Method and device for efficient frame erasure concealment in linear predictive based speech codecs

ABSTRACT

The present invention relates to a method and device for improving concealment of frame erasure caused by frames of an encoded sound signal erased during transmission from an encoder ( 106 ) to a decoder ( 110 ), and for accelerating recovery of the decoder after non erased frames of the encoded sound signal have been received. For that purpose, concealment/recovery parameters are determined in the encoder or decoder. When determined in the encoder ( 106 ), the concealment/recovery parameters are transmitted to the decoder ( 110 ). In the decoder, erasure frame concealment and decoder recovery is conducted in response to the concealment/recovery parameters. The concealment/recovery parameters may be selected from the group consisting of: a signal classification parameter, an energy information parameter and a phase information parameter. The determination of the concealment/recovery parameters comprises classifying the successive frames of the encoded sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset, and this classification is determined on the basis of at least a part of the following parameters: a normalized correlation parameter, a spectral tilt parameter, a signal-to-noise ratio parameter, a pitch stability parameter, a relative frame energy parameter, and a zero crossing parameter.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is the national phase of International (PCT) PatentApplication Serial No. PCT/CA03/00830, filed May 30, 2003, publishedunder PCT Article 21(2) in English, which claims priority to and thebenefit of Canadian Patent Application No. 2,388,439, filed May 31,2002, the disclosures of which are incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to a technique for digitally encoding asound signal, in particular but not exclusively a speech signal, in viewof transmitting and/or synthesizing this sound signal. Morespecifically, the present invention relates to robust encoding anddecoding of sound signals to maintain good performance in case of erasedframe(s) due, for example, to channel errors in wireless systems or lostpackets in voice over packet network applications.

BACKGROUND OF THE INVENTION

The demand for efficient digital narrow- and wideband speech encodingtechniques with a good trade-off between the subjective quality and bitrate is increasing in various application areas such asteleconferencing, multimedia, and wireless communications. Untilrecently, a telephone bandwidth constrained into a range of 200-3400 Hzhas mainly been used in speech coding applications. However, widebandspeech applications provide increased intelligibility and naturalness incommunication compared to the conventional telephone bandwidth. Abandwidth in the range of 50-7000 Hz has been found sufficient fordelivering a good quality giving an impression of face-to-facecommunication. For general audio signals, this bandwidth gives anacceptable subjective quality, but is still lower than the quality of FMradio or CD that operate on ranges of 20-16000 Hz and 20-20000 Hz,respectively.

A speech encoder converts a speech signal into a digital bit streamwhich is transmitted over a communication channel or stored in a storagemedium. The speech signal is digitized, that is, sampled and quantizedwith usually 16-bits per sample. The speech encoder has the role ofrepresenting these digital samples with a smaller number of bits whilemaintaining a good subjective speech quality. The speech decoder orsynthesizer operates on the transmitted or stored bit stream andconverts it back to a sound signal.

Code-Excited Linear Prediction (CELP) coding is one of the bestavailable techniques for achieving a good compromise between thesubjective quality and bit rate. This encoding technique is a basis ofseveral speech encoding standards both in wireless and wirelineapplications. In CELP encoding, the sampled speech signal is processedin successive blocks of L samples usually called frames, where L is apredetermined number corresponding typically to 10-30 ms. A linearprediction (LP) filter is computed and transmitted every frame. Thecomputation of the LP filter typically needs a lookahead, a 5-15 msspeech segment from the subsequent frame. The L-sample frame is dividedinto smaller blocks called subframes. Usually the number of subframes isthree or four resulting in 4-10 ms subframes. In each subframe, anexcitation signal is usually obtained from two components, the pastexcitation and the innovative, fixed-codebook excitation. The componentformed from the past excitation is often referred to as the adaptivecodebook or pitch excitation. The parameters characterizing theexcitation signal are coded and transmitted to the decoder, where thereconstructed excitation signal is used as the input of the LP filter.

As the main applications of low bit rate speech encoding are wirelessmobile communication systems and voice over packet networks, thenincreasing the robustness of speech codecs in case of frame erasuresbecomes of significant importance. In wireless cellular systems, theenergy of the received signal can exhibit frequent severe fadesresulting in high bit error rates and this becomes more evident at thecell boundaries. In this case the channel decoder fails to correct theerrors in the received frame and as a consequence, the error detectorusually used after the channel decoder will declare the frame as erased.In voice over packet network applications, the speech signal ispacketized where usually a 20 ms frame is placed in each packet. Inpacket-switched communications, a packet dropping can occur at a routerif the number of packets become very large, or the packet can reach thereceiver after a long delay and it should be declared as lost if itsdelay is more than the length of a jitter buffer at the receiver side.In these systems, the codec is subjected to typically 3 to 5% frameerasure rates. Furthermore, the use of wideband speech encoding is animportant asset to these systems in order to allow them to compete withtraditional PSTN (public switched telephone network) that uses thelegacy narrow band speech signals.

The adaptive codebook, or the pitch predictor, in CELP plays animportant role in maintaining high speech quality at low bit rates.However, since the content of the adaptive codebook is based on thesignal from past frames, this makes the codec model sensitive to frameloss. In case of erased or lost frames, the content of the adaptivecodebook at the decoder becomes different from its content at theencoder. Thus, after a lost frame is concealed and consequent goodframes are received, the synthesized signal in the received good framesis different from the intended synthesis signal since the adaptivecodebook contribution has been changed. The impact of a lost framedepends on the nature of the speech segment in which the erasureoccurred. If the erasure occurs in a stationary segment of the signalthen an efficient frame erasure concealment can be performed and theimpact on consequent good frames can be minimized. On the other hand, ifthe erasure occurs in an speech onset or a transition, the effect of theerasure can propagate through several frames. For instance, if thebeginning of a voiced segment is lost, then the first pitch period willbe missing from the adaptive codebook content. This will have a severeeffect on the pitch predictor in consequent good frames, resulting inlong time before the synthesis signal converge to the intended one atthe encoder.

SUMMARY OF THE INVENTION

The present invention relates to a method for improving concealment offrame erasure caused by frames of an encoded sound signal erased duringtransmission from an encoder to a decoder, and for accelerating recoveryof the decoder after non erased frames of the encoded sound signal havebeen received, comprising:

determining, in the encoder, concealment/recovery parameters;

transmitting to the decoder the concealment/recovery parametersdetermined in the encoder; and

in the decoder, conducting erasure frame concealment and decoderrecovery in response to the received concealment/recovery parameters.

The present invention also relates to a method for the concealment offrame erasure caused by frames erased during transmission of a soundsignal encoded under the form of signal-encoding parameters from anencoder to a decoder, and for accelerating recovery of the decoder afternon erased frames of the encoded sound signal have been received,comprising:

determining, in the decoder, concealment/recovery parameters from thesignal-encoding parameters;

in the decoder, conducting erased frame concealment and decoder recoveryin response to the determined concealment/recovery parameters.

In accordance with the present invention, there is also provided adevice for improving concealment of frame erasure caused by frames of anencoded sound signal erased during transmission from an encoder to adecoder, and for accelerating recovery of the decoder after non erasedframes of the encoded sound signal have been received, comprising:

means for determining, in the encoder, concealment/recovery parameters;

means for transmitting to the decoder the concealment/recoveryparameters determined in the encoder; and

in the decoder, means for conducting erasure frame concealment anddecoder recovery in response to the received concealment/recoveryparameters.

According to the invention, there is further provided a device for theconcealment of frame erasure caused by frames erased during transmissionof a sound signal encoded under the form of signal-encoding parametersfrom an encoder to a decoder, and for accelerating recovery of thedecoder after non erased frames of the encoded sound signal have beenreceived, comprising:

means, for determining, in the decoder, concealment/recovery parametersfrom the signal-encoding parameters;

in the decoder, means for conducting erased frame concealment anddecoder recovery in response to the determined concealment/recoveryparameters.

The present invention is also concerned with a system for encoding anddecoding a sound signal, and a sound signal decoder using the abovedefined devices for improving concealment of frame erasure caused byframes of the encoded sound signal erased during transmission from theencoder to the decoder, and for accelerating recovery of the decoderafter non erased frames of the encoded sound signal have been received.

The foregoing and other objects, advantages and features of the presentinvention will become more apparent upon reading of the following nonrestrictive description of illustrative embodiments thereof, given byway of example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of a speech communication systemillustrating an application of speech encoding and decoding devices inaccordance with the present invention;

FIG. 2 is a schematic block diagram of an example of wideband encodingdevice (AMR-WB encoder);

FIG. 3 is a schematic block diagram of an example of wideband decodingdevice (AMR-WB decoder);

FIG. 4 is a simplified block diagram of the AMR-WB encoder of FIG. 2,wherein the down-sampler module, the high-pass filter module and thepre-emphasis filter module have been grouped in a single pre-processingmodule, and wherein the closed-loop pitch search module, the zero-inputresponse calculator module, the impulse response generator module, theinnovative excitation search module and the memory update module havebeen grouped in a single closed-loop pitch and innovative codebooksearch module;

FIG. 5 is an extension of the block diagram of FIG. 4 in which modulesrelated to an illustrative embodiment of the present invention have beenadded;

FIG. 6 is a block diagram explaining the situation when an artificialonset is constructed; and

FIG. 7 is a schematic diagram showing an illustrative embodiment of aframe classification state machine for the erasure concealment.

DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS

Although the illustrative embodiments of the present invention will bedescribed in the following description in relation to a speech signal,it should be kept in mind that the concepts of the present inventionequally apply to other types of signal, in particular but notexclusively to other types of sound signals.

FIG. 1 illustrates a speech communication system 100 depicting the useof speech encoding and decoding in the context of the present invention.The speech communication system 100 of FIG. 1 supports transmission of aspeech signal across a communication channel 101. Although it maycomprise for example a wire, an optical link or a fiber link, thecommunication channel 101 typically comprises at least in part a radiofrequency link. The radio frequency link often supports multiple,simultaneous speech communications requiring shared bandwidth resourcessuch as may be found with cellular telephony systems. Although notshown, the communication channel 101 may be replaced by a storage devicein a single device embodiment of the system 100 that records and storesthe encoded speech signal for later playback.

In the speech communication system 100 of FIG. 1, a microphone 102produces an analog speech signal 103 that is supplied to ananalog-to-digital (A/D) converter 104 for converting it into a digitalspeech signal 105. A speech encoder 106 encodes the digital speechsignal 105 to produce a set of signal-encoding parameters 107 that arecoded into binary form and delivered to a channel encoder 108. Theoptional channel encoder 108 adds redundancy to the binaryrepresentation of the signal-encoding parameters 107 before transmittingthem over the communication channel 101.

In the receiver, a channel decoder 109 utilizes the said redundantinformation in the received bit stream 111 to detect and correct channelerrors that occurred during the transmission. A speech decoder 110converts the bit stream 112 received from the channel decoder 109 backto a set of signal-encoding parameters and creates from the recoveredsignal-encoding parameters a digital synthesized speech signal 113. Thedigital synthesized speech signal 113 reconstructed at the speechdecoder 110 is converted to an analog form 114 by a digital-to-analog(D/A) converter 115 and played back through a loudspeaker unit 116.

The illustrative embodiment of efficient frame erasure concealmentmethod disclosed in the present specification can be used with eithernarrowband or wideband linear prediction based codecs. The presentillustrative embodiment is disclosed in relation to a wideband speechcodec that has been standardized by the International TelecommunicationsUnion (ITU) as Recommendation G.722.2 and known as the AMR-WB codec(Adaptive Multi-Rate Wideband codec) [ITU-T Recommendation G.722.2“Wideband coding of speech at around 16 kbit/s using Adaptive Multi-RateWideband (AMR-WB)”, Geneva, 2002]. This codec has also been selected bythe third generation partnership project (3GPP) for wideband telephonyin third generation wireless systems [3GPP TS 26.190, “AMR WidebandSpeech Codec: Transcoding Functions,” 3GPP Technical Specification].AMR-WB can operate at 9 bit rates ranging from 6.6 to 23.85 kbit/s. Thebit rate of 12.65 kbit/s is used to illustrate the present invention.

Here, it should be understood that the illustrative embodiment ofefficient frame erasure concealment method could be applied to othertypes of codecs.

In the following sections, an overview of the AMR-WB encoder and decoderwill be first given. Then, the illustrative embodiment of the novelapproach to improve the robustness of the codec will be disclosed.

Overview of the AMR-WB Encoder

The sampled speech signal is encoded on a block by block basis by theencoding device 200 of FIG. 2 which is broken down into eleven modulesnumbered from 201 to 211.

The input speech signal 212 is therefore processed on a block-by-blockbasis, i.e. in the above-mentioned L-sample blocks called frames.

Referring to FIG. 2, the sampled input speech signal 212 is down-sampledin a down-sampler module 201. The signal is down-sampled from 16 kHzdown to 12.8 kHz, using techniques well known to those of ordinaryskilled in the art. Down-sampling increases the coding efficiency, sincea smaller frequency bandwidth is encoded. This also reduces thealgorithmic complexity since the number of samples in a frame isdecreased. After down-sampling, the 320-sample frame of 20 ms is reducedto a 256-sample frame (down-sampling ratio of 4/5).

The input frame is then supplied to the optional pre-processing module202. Pre-processing module 202 may consist of a high-pass filter with a50 Hz cut-off frequency. High-pass filter 202 removes the unwanted soundcomponents below 50 Hz.

The down-sampled, pre-processed signal is denoted by s_(p)(n), n=0, 1,2, . . . , L−1, where L is the length of the frame (256 at a samplingfrequency of 12.8 kHz). In an illustrative embodiment of the preemphasisfilter 203, the signal s_(p)(n) is preemphasized using a filter havingthe following transfer function:P(z)=1−μz ⁻¹where μ is a preemphasis factor with a value located between 0 and 1 (atypical value is μ=0.7). The function of the preemphasis filter 203 isto enhance the high frequency contents of the input speech signal. Italso reduces the dynamic range of the input speech signal, which rendersit more suitable for fixed-point implementation. Preemphasis also playsan important role in achieving a proper overall perceptual weighting ofthe quantization error, which contributes to improved sound quality.This will be explained in more detail herein below.

The output of the preemphasis filter 203 is denoted s(n). This signal isused for performing LP analysis in module 204. LP analysis is atechnique well known to those of ordinary skill in the art. In thisillustrative implementation, the autocorrelation approach is used. Inthe autocorrelation approach, the signal s(n) is first windowed using,typically, a Hamming window having a length of the order of 30-40 ms.The autocorrelations are computed from the windowed signal, andLevinson-Durbin recursion is used to compute LP filter coefficients,a_(i), where i=1, . . . , p, and where p is the LP order, which istypically 16 in wideband coding. The parameters a_(i) are thecoefficients of the transfer function A(z) of the LP filter, which isgiven by the following relation:

${A(z)} = {1 + {\sum\limits_{i = 1}^{p}{a_{i}z^{- i}}}}$

LP analysis is performed in module 204, which also performs thequantization and interpolation of the LP filter coefficients. The LPfilter coefficients are first transformed into another equivalent domainmore suitable for quantization and interpolation purposes. The linespectral pair (LSP) and immitance spectral pair (ISP) domains are twodomains in which quantization and interpolation can be efficientlyperformed. The 16 LP filter coefficients, a_(i), can be quantized in theorder of 30 to 50 bits using split or multi-stage quantization, or acombination thereof. The purpose of the interpolation is to enableupdating the LP filter coefficients every subframe while transmittingthem once every frame, which improves the encoder performance withoutincreasing the bit rate. Quantization and interpolation of the LP filtercoefficients is believed to be otherwise well known to those of ordinaryskill in the art and, accordingly, will not be further described in thepresent specification.

The following paragraphs will describe the rest of the coding operationsperformed on a subframe basis. In this illustrative implementation, theinput frame is divided into 4 subframes of 5 ms (64 samples at thesampling frequency of 12.8 kHz). In the following description, thefilter A(z) denotes the unquantized interpolated LP filter of thesubframe, and the filter Â(z) denotes the quantized interpolated LPfilter of the subframe. The filter Â(z) is supplied every subframe to amultiplexer 213 for transmission through a communication channel.

In analysis-by-synthesis encoders, the optimum pitch and innovationparameters are searched by minimizing the mean squared error between theinput speech signal 212 and a synthesized speech signal in aperceptually weighted domain. The weighted signal s_(w)(n) is computedin a perceptual weighting filter 205 in response to the signal s(n) fromthe pre-emphasis filter 203. A perceptual weighting filter 205 withfixed denominator, suited for wideband signals, is used. An example oftransfer function for the perceptual weighting filter 205 is given bythe following relation:W(z)=A(z/y ₁)/(1−y ₂ z ⁻¹) where 0<y ₂ <y ₁≦1

In order to simplify the pitch analysis, an open-loop pitch lag T_(OL)is first estimated in an open-loop pitch search module 206 from theweighted speech signal s_(w)(n). Then the closed-loop pitch analysis,which is performed in a closed-loop pitch search module 207 on asubframe basis, is restricted around the open-loop pitch lag T_(OL)which significantly reduces the search complexity of the LTP parametersT (pitch lag) and b (pitch gain) The open-loop pitch analysis is usuallyperformed in module 206 once every 10 ms (two subframes) usingtechniques well known to those of ordinary skill in the art.

The target vector x for LTP (Long Term Prediction) analysis is firstcomputed. This is usually done by subtracting the zero-input response s₀of weighted synthesis filter W(z)/Â(z) from the weighted speech signals_(w)(n). This zero-input response s₀ is calculated by a zero-inputresponse calculator 208 in response to the quantized interpolation LPfilter Â(z) from the LP analysis, quantization and interpolation module204 and to the initial states of the weighted synthesis filter W(z)Â(z)stored in memory update module 211 in response to the LP filters A(z)and Â(z), and the excitation vector u. This operation is well known tothose of ordinary skill in the art and, accordingly, will not be furtherdescribed.

A N-dimensional impulse response vector h of the weighted synthesisfilter W(z)/Â(z) is computed in the impulse response generator 209 usingthe coefficients of the LP filter A(z) and Â(z) from module 204. Again,this operation is well known to those of ordinary skill in the art and,accordingly, will not be further described in the present specification.

The closed-loop pitch (or pitch codebook) parameters b, T and j arecomputed in the closed-loop pitch search module 207, which uses thetarget vector x, the impulse response vector h and the open-loop pitchlag T_(OL) as inputs.

The pitch search consists of finding the best pitch lag T and gain bthat minimize a mean squared weighted pitch prediction error, forexamplee ^((j)) =∥x−b ^((j)) y ^((j))∥² where j=1, 2, . . . , kbetween the target vector x and a scaled filtered version of the pastexcitation.

More specifically, in the present illustrative implementation, the pitch(pitch codebook) search is composed of three stages.

In the first stage, an open-loop pitch lag T_(OL) is estimated in theopen-loop pitch search module 206 in response to the weighted speechsignal s_(w)(n). As indicated in the foregoing description, thisopen-loop pitch analysis is usually performed once every 10 ms (twosubframes) using techniques well known to those of ordinary skill in theart.

In the second stage, a search criterion C is searched in the closed-looppitch search module 207 for integer pitch lags around the estimatedopen-loop pitch lag T_(OL) (usually ±5), which significantly simplifiesthe search procedure. A simple procedure is used for updating thefiltered codevector y_(T) (this vector is defined in the followingdescription) without the need to compute the convolution for every pitchlag. An example of search criterion C is given by:

$C = \frac{x^{t}y_{T}}{\sqrt{y_{T}^{t}y_{T}}}$where t denotes vector transpose

Once an optimum integer pitch lag is found in the second stage, a thirdstage of the search (module 207) tests, by means of the search criterionC, the fractions around that optimum integer pitch lag. For example, theAMR-WB standard uses ¼ and ½ subsample resolution.

In wideband signals, the harmonic structure exists only up to a certainfrequency, depending on the speech segment. Thus, in order to achieveefficient representation of the pitch contribution in voiced segments ofa wideband speech signal, flexibility is needed to vary the amount ofperiodicity over the wideband spectrum. This is achieved by processingthe pitch codevector through a plurality of frequency shaping filters(for example low-pass or band-pass filters). And the frequency shapingfilter that minimizes the mean-squared weighted error e^((j)) isselected. The selected frequency shaping filter is identified by anindex j.

The pitch codebook index T is encoded and transmitted to the multiplexer213 for transmission through a communication channel. The pitch gain bis quantized and transmitted to the multiplexer 213. An extra bit isused to encode the index j, this extra bit being also supplied to themultiplexer 213.

Once the pitch, or LTP (Long Term Prediction) parameters b, T, and j aredetermined, the next step is to search for the optimum innovativeexcitation by means of the innovative excitation search module 210 ofFIG. 2. First, the target vector x is updated by subtracting the LTPcontribution:x′=x−by _(T)where b is the pitch gain and y_(T) is the filtered pitch codebookvector (the past excitation at delay T filtered with the selectedfrequency shaping filter (index j) filter and convolved with the impulseresponse h).

The innovative excitation search procedure in CELP is performed in aninnovation codebook to find the optimum excitation codevector c_(k) andgain g which minimize the mean-squared error E between the target vectorx′ and a scaled filtered version of the codevector c_(k), for example:E=λx′−gHc _(k)∥²where H is a lower triangular convolution matrix derived from theimpulse response vector h. The index k of the innovation codebookcorresponding to the found optimum codevector c_(k) and the gain g aresupplied to the multiplexer 213 for transmission through a communicationchannel.

It should be noted that the used innovation codebook is a dynamiccodebook consisting of an algebraic codebook followed by an adaptiveprefilter F(z) which enhances special spectral components in order toimprove the synthesis speech quality, according to U.S. Pat. No.5,444,816 granted to Adoul et al. on Aug. 22, 1995. In this illustrativeimplementation, the innovative codebook search is performed in module210 by means of an algebraic codebook as described in U.S. Pat. No.5,444,816 (Adoul et al.) issued on Aug. 22, 1995; U.S. Pat. No.5,699,482 granted to Adoul et al., on Dec. 17, 1997; U.S. Pat. No.5,754,976 granted to Adoul et al., on May 19, 1998; and U.S. Pat. No.5,701,392 (Adoul et al.) dated Dec. 23, 1997.

Overview of AMR-WB Decoder

The speech decoder 300 of FIG. 3 illustrates the various steps carriedout between the digital input 322 (input bit stream to the demultiplexer317) and the output sampled speech signal 323 (output of the adder 321).

Demultiplexer 317 extracts the synthesis model parameters from thebinary information (input bit stream 322) received from a digital inputchannel. From each received binary frame, the extracted parameters are:

-   -   the quantized, interpolated LP coefficients Â(z) also called        short-term prediction parameters (STP) produced once per frame;    -   the long-term prediction (LTP) parameters T, b, and j (for each        subframe); and    -   the innovation codebook index k and gain g (for each subframe).

The current speech signal is synthesized based on these parameters aswill be explained hereinbelow.

The innovation codebook 318 is responsive to the index k to produce theinnovation codevector c_(k), which is scaled by the decoded gain factorg through an amplifier 324. In the illustrative implementation, aninnovation codebook as described in the above mentioned U.S. Pat. Nos.5,444,816; 5,699,482; 5,754,976; and 5,701,392 is used to produce theinnovative codevector c_(k).

The generated scaled codevector at the output of the amplifier 324 isprocessed through a frequency-dependent pitch enhancer 305.

Enhancing the periodicity of the excitation signal u improves thequality of voiced segments. The periodicity enhancement is achieved byfiltering the innovative codevector c_(k) from the innovation (fixed)codebook through an innovation filter F(z) (pitch enhancer 305) whosefrequency response emphasizes the higher frequencies more than the lowerfrequencies. The coefficients of the innovation filter F(z) are relatedto the amount of periodicity in the excitation signal u.

An efficient, illustrative way to derive the coefficients of theinnovation filter F(z) is to relate them to the amount of pitchcontribution in the total excitation signal u. This results in afrequency response depending on the subframe periodicity, where higherfrequencies are more strongly emphasized (stronger overall slope) forhigher pitch gains. The innovation filter 305 has the effect of loweringthe energy of the innovation codevector c_(k) at lower frequencies whenthe excitation signal u is more periodic, which enhances the periodicityof the excitation signal u at lower frequencies more than higherfrequencies. A suggested form for the innovation filter 305 is thefollowing:F(z)=−αz+1−αz ⁻¹where α is a periodicity factor derived from the level of periodicity ofthe excitation signal u. The periodicity factor α is computed in thevoicing factor generator 304. First, a voicing factor r_(v) is computedin voicing factor generator 304 by:r _(v)=(E _(v) −E _(c))/(E _(v) +E _(c))where E_(v) is the energy of the scaled pitch codevector bv_(T) andE_(C) is the energy of the scaled innovative codevector gc_(k). That is:

$\begin{matrix}{E_{v} = {{b^{2}v_{T}^{t}v_{T}} = {b^{2}\;{\sum\limits_{n = 0}^{N - 1}{v_{T}^{2}(n)}}}}} \\{and} \\{E_{c} = {{g^{2}c_{k}^{t}c_{k}} = {g^{2}\;{\sum\limits_{n = 0}^{N - 1}{c_{k}^{2}(n)}}}}}\end{matrix}$Note that the value of r_(v) lies between −1 and 1 (1 corresponds topurely voiced signals and −1 corresponds to purely unvoiced signals).

The above mentioned scaled pitch codevector bv_(T) is produced byapplying the pitch delay T to a pitch codebook 301 to produce a pitchcodevector. The pitch codevector is then processed through a low-passfilter 302 whose cut-off frequency is selected in relation to index jfrom the demultiplexer 317 to produce the filtered pitch codevectorv_(T). Then, the filtered pitch codevector v_(T) is then amplified bythe pitch gain b by an amplifier 326 to produce the scaled pitchcodevector bv_(T).

In this illustrative implementation, the factor α is then computed invoicing factor generator 304 by:α=0.125(1+r _(V))which corresponds to a value of 0 for purely unvoiced signals and 0.25for purely voiced signals.

The enhanced signal c_(f) is therefore computed by filtering the scaledinnovative codevector gc_(k) through the innovation filter 305 (F(z)).

The enhanced excitation signal u′ is computed by the adder 320 as:u′=c _(f) +bv _(T)

It should be noted that this process is not performed at the encoder200. Thus, it is essential to update the content of the pitch codebook301 using the past value of the excitation signal u without enhancementstored in memory 303 to keep synchronism between the encoder 200 anddecoder 300. Therefore, the excitation signal u is used to update thememory 303 of the pitch codebook 301 and the enhanced excitation signalu′ is used at the input of the LP synthesis filter 306.

The synthesized signal s′ is computed by filtering the enhancedexcitation signal u′ through the LP synthesis filter 306 which has theform 1/Â(z), where Â(z) is the quantized, interpolated LP filter in thecurrent subframe. As can be seen in FIG. 3, the quantized, interpolatedLP coefficients Â(z) on line 325 from the demultiplexer 317 are suppliedto the LP synthesis filter 306 to adjust the parameters of the LPsynthesis filter 306 accordingly. The deemphasis filter 307 is theinverse of the preemphasis filter 203 of FIG. 2. The transfer functionof the deemphasis filter 307 is given byD(z)=1/(1−μz ⁻¹)where μ is a preemphasis factor with a value located between 0 and 0.1(a typical value is μ=0.7). A higher-order filter could also be used.

The vector s′ is filtered through the deemphasis filter D(z) 307 toobtain the vector s_(d), which is processed through the high-pass filter308 to remove the unwanted frequencies below 50 Hz and further obtains_(h).

The oversampler 309 conducts the inverse process of the downsampler 201of FIG. 2. In this illustrative embodiment, over-sampling converts the12.8 kHz sampling rate back to the original 16 kHz sampling rate, usingtechniques well known to those of ordinary skill in the art. Theoversampled synthesis signal is denoted ŝ. Signal ŝ is also referred toas the synthesized wideband intermediate signal.

The oversampled synthesis signal ŝ does not contain the higher frequencycomponents which were lost during the downsampling process (module 201of FIG. 2) at the encoder 200. This gives a low-pass perception to thesynthesized speech signal. To restore the full band of the originalsignal, a high frequency generation procedure is performed in module 310and requires input from voicing factor generator 304 (FIG. 3).

The resulting band-pass filtered noise sequence z from the highfrequency generation module 310 is added by the adder 321 to theoversampled synthesized speech signal ŝ to obtain the finalreconstructed output speech signal s_(out) on the output 323. An exampleof high frequency regeneration process is described in International PCTpatent application published under No. WO 00/25305 on May 4, 2000.

The bit allocation of the AMR-WB codec at 12.65 kbit/s is given in Table1.

TABLE 1 Bit allocation in the 12.65-kbit/s mode Parameter Bits / FrameLP Parameters  46 Pitch Delay  30 = 9 + 6 + 9 + 6 Pitch Filtering  4 =1 + 1 + 1 + 1 Gains  28 = 7 + 7 + 7 + 7 Algebraic Codebook 144 = 36 +36 + 36 + 36 Mode Bit  1 Total 253 bits = 12.65 kbit/sRobust Frame Erasure Concealment

The erasure of frames has a major effect on the synthesized speechquality in digital speech communication systems, especially whenoperating in wireless environments and packet-switched networks. Inwireless cellular systems, the energy of the received signal can exhibitfrequent severe fades resulting in high bit error rates and this becomesmore evident at the cell boundaries. In this case the channel decoderfails to correct the errors in the received frame and as a consequence,the error detector usually used after the channel decoder will declarethe frame as erased. In voice over packet network applications, such asVoice over Internet Protocol (VoIP), the speech signal is packetizedwhere usually a 20 ms frame is placed in each packet. In packet-switchedcommunications, a packet dropping can occur at a router if the number ofpackets becomes very large, or the packet can arrive at the receiverafter a long delay and it should be declared as lost if its delay ismore than the length of a jitter buffer at the receiver side. In thesesystems, the codec is subjected to typically 3 to 5% frame erasurerates.

The problem of frame erasure (FER) processing is basically twofold.First, when an erased frame indicator arrives, the missing frame must begenerated by using the information sent in the previous frame and byestimating the signal evolution in the missing frame. The success of theestimation depends not only on the concealment strategy, but also on theplace in the speech signal where the erasure happens. Secondly, a smoothtransition must be assured when normal operation recovers, i.e. when thefirst good frame arrives after a block of erased frames (one or more).This is not a trivial task as the true synthesis and the estimatedsynthesis can evolve differently. When the first good frame arrives, thedecoder is hence desynchronized from the encoder. The main reason isthat low bit rate encoders rely on pitch prediction, and during erasedframes, the memory of the pitch predictor is no longer the same as theone at the encoder. The problem is amplified when many consecutiveframes are erased. As for the concealment, the difficulty of the normalprocessing recovery depends on the type of speech signal where theerasure occurred.

The negative effect of frame erasures can be significantly reduced byadapting the concealment and the recovery of normal processing (furtherrecovery) to the type of the speech signal where the erasure occurs. Forthis purpose, it is necessary to classify each speech frame. Thisclassification can be done at the encoder and transmitted.Alternatively, it can be estimated at the decoder.

For the best concealment and recovery, there are few criticalcharacteristics of the speech signal that must be carefully controlled.These critical characteristics are the signal energy or the amplitude,the amount of periodicity, the spectral envelope and the pitch period.In case of a voiced speech recovery, further improvement can be achievedby a phase control. With a slight increase in the bit rate, fewsupplementary parameters can be quantized and transmitted for bettercontrol. If no additional bandwidth is available, the parameters can beestimated at the decoder. With these parameters controlled, the frameerasure concealment and recovery can be significantly improved,especially by improving the convergence of the decoded signal to theactual signal at the encoder and alleviating the effect of mismatchbetween the encoder and decoder when normal processing recovers.

In the present illustrative embodiment of the present invention, methodsfor efficient frame erasure concealment, and methods for extracting andtransmitting parameters that will improve the performance andconvergence at the decoder in the frames following an erased frame aredisclosed. These parameters include two or more of the following: frameclassification, energy, voicing information, and phase information.Further, methods for extracting such parameters at the decoder iftransmission of extra bits is not possible, are disclosed. Finally,methods for improving the decoder convergence in good frames followingan erased frame are also disclosed.

The frame erasure concealment techniques according to the presentillustrative embodiment have been applied to the AMR-WB codec describedabove. This codec will serve as an example framework for theimplementation of the FER concealment methods in the followingdescription. As explained above, the input speech signal 212 to thecodec has a 16 kHz sampling frequency, but it is downsampled to a 12.8kHz sampling frequency before further processing. In the presentillustrative embodiment, FER processing is done on the downsampledsignal.

FIG. 4 gives a simplified block diagram of the AMR-WB encoder 400. Inthis simplified block diagram, the downsampler 201, high-pass filter 202and preemphasis filter 203 are grouped together in the preprocessingmodule 401. Also, the closed-loop search module 207, the zero-inputresponse calculator 208, the impulse response calculator 209, theinnovative excitation search module 210, and the memory update module211 are grouped in a closed-loop pitch and innovation codebook searchmodules 402. This grouping is done to simplify the introduction of thenew modules related to the illustrative embodiment of the presentinvention.

FIG. 5 is an extension of the block diagram of FIG. 4 where the modulesrelated to the illustrative embodiment of the present invention areadded. In these added modules 500 to 507, additional parameters arecomputed, quantized, and transmitted with the aim to improve the FERconcealment and the convergence and recovery of the decoder after erasedframes. In the present illustrative embodiment, these parameters includesignal classification, energy, and phase information (the estimatedposition of the first glottal pulse in a frame).

In the next sections, computation and quantization of these additionalparameters will be given in detail and become more apparent withreference to FIG. 5. Among these parameters, signal classification willbe treated in more detail. In the subsequent sections, efficient FERconcealment using these additional parameters to improve the convergencewill be explained.

Signal Classification for FER Concealment and Recovery

The basic idea behind using a classification of the speech for a signalreconstruction in the presence of erased frames consists of the factthat the ideal concealment strategy is different for quasi-stationaryspeech segments and for speech segments with rapidly changingcharacteristics. While the best processing of erased frames innon-stationary speech segments can be summarized as a rapid convergenceof speech-encoding parameters to the ambient noise characteristics, inthe case of quasi-stationary signal, the speech-encoding parameters donot vary dramatically and can be kept practically unchanged duringseveral adjacent erased frames before being damped. Also, the optimalmethod for a signal recovery following an erased block of frames varieswith the classification of the speech signal.

The speech signal can be roughly classified as voiced, unvoiced andpauses. Voiced speech contains an important amount of periodiccomponents and can be further divided in the following categories:voiced onsets, voiced segments, voiced transitions and voiced offsets. Avoiced onset is defined as a beginning of a voiced speech segment aftera pause or an unvoiced segment. During voiced segments, the speechsignal parameters (spectral envelope, pitch period, ratio of periodicand non-periodic components, energy) vary slowly from frame to frame. Avoiced transition is characterized by rapid variations of a voicedspeech, such as a transition between vowels. Voiced offsets arecharacterized by a gradual decrease of energy and voicing at the end ofvoiced segments.

The unvoiced parts of the signal are characterized by missing theperiodic component and can be further divided into unstable frames,where the energy and the spectrum changes rapidly, and stable frameswhere these characteristics remain relatively stable. Remaining framesare classified as silence. Silence frames comprise all frames withoutactive speech, i.e. also noise-only frames if a background noise ispresent.

Not all of the above mentioned classes need a separate processing.Hence, for the purposes of error concealment techniques, some of thesignal classes are grouped together.

Classification at the Encoder

When there is an available bandwidth in the bitstream to include theclassification information, the classification can be done at theencoder. This has several advantages. The most important is that thereis often a look-ahead in speech encoders. The look-ahead permits toestimate the evolution of the signal in the following frame andconsequently the classification can be done by taking into account thefuture signal behavior. Generally, the longer is the look-ahead, thebetter can be the classification. A further advantage is a complexityreduction, as most of the signal processing necessary for frame erasureconcealment is needed anyway for speech encoding. Finally, there is alsothe advantage to work with the original signal instead of thesynthesized signal.

The frame classification is done with the consideration of theconcealment and recovery strategy in mind. In other words, any frame isclassified in such a way that the concealment can be optimal if thefollowing frame is missing, or that the recovery can be optimal if theprevious frame was lost. Some of the classes used for the FER processingneed not be transmitted, as they can be deduced without ambiguity at thedecoder. In the present illustrative embodiment, five (5) distinctclasses are used, and defined as follows:

-   -   UNVOICED class comprises all unvoiced speech frames and all        frames without active speech. A voiced offset frame can be also        classified as UNVOICED if its end tends to be unvoiced and the        concealment designed for unvoiced frames can be used for the        following frame in case it is lost.    -   UNVOICED TRANSITION class comprises unvoiced frames with a        possible voiced onset at the end. The onset is however still too        short or not built well enough to use the concealment designed        for voiced frames. The UNVOICED TRANSITION class can follow only        a frame classified as UNVOICED or UNVOICED TRANSITION.    -   VOICED TRANSITION class comprises voiced frames with relatively        weak voiced characteristics. Those are typically voiced frames        with rapidly changing characteristics (transitions between        vowels) or voiced offsets lasting the whole frame. The VOICED        TRANSITION class can follow only a frame classified as VOICED        TRANSITION, VOICED or ONSET.    -   VOICED class comprises voiced frames with stable        characteristics. This class can follow only a frame classified        as VOICED TRANSITION, VOICED or ONSET.    -   ONSET class comprises all voiced frames with stable        characteristics following a frame classified as UNVOICED or        UNVOICED TRANSITION. Frames classified as ONSET correspond to        voiced onset frames where the onset is already sufficiently well        built for the use of the concealment designed for lost voiced        frames. The concealment techniques used for a frame erasure        following the ONSET class are the same as following the VOICED        class. The difference is in the recovery strategy. If an ONSET        class frame is lost (i.e. a VOICED good frame arrives after an        erasure, but the last good frame before the erasure was        UNVOICED), a special technique can be used to artificially        reconstruct the lost onset. This scenario can be seen in FIG. 6.        The artificial onset reconstruction techniques will be described        in more detail in the following description. On the other hand        if an ONSET good frame arrives after an erasure and the last        good frame before the erasure was UNVOICED, this special        processing is not needed, as the onset has not been lost (has        not been in the lost frame).

The classification state diagram is outlined in FIG. 7. If the availablebandwidth is sufficient, the classification is done in the encoder andtransmitted using 2 bits. As it can be seen from FIG. 7, UNVOICEDTRANSITION class and VOICED TRANSITION class can be grouped together asthey can be unambiguously differentiated at the decoder (UNVOICEDTRANSITION can follow only UNVOICED or UNVOICED TRANSITION frames,VOICED TRANSITION can follow only ONSET, VOICED or VOICED TRANSITIONframes). The following parameters are used for the classification: anormalized correlation r_(x), a spectral tilt measure et, a signal tonoise ratio snr, a pitch stability counter pc, a relative frame energyof the signal at the end of the current frame E_(s) and a zero-crossingcounter zc. As can be seen in the following detailed analysis, thecomputation of these parameters uses the available look-ahead as much aspossible to take into account the behavior of the speech signal also inthe following frame.

The normalized correlation r_(x) is computed as part of the open-looppitch search module 206 of FIG. 5. This module 206 usually outputs theopen-loop pitch estimate every 10 ms (twice per frame). Here, it is alsoused to output the normalized correlation measures. These normalizedcorrelations are computed on the current weighted speech signal s_(w)(n)and the past weighted speech signal at the open-loop pitch delay. Inorder to reduce the complexity, the weighted speech signal s_(w)(n) isdownsampled by a factor of 2 prior to the open-loop pitch analysis downto the sampling frequency of 6400 Hz [3GPP TS 26.190, “AMR WidebandSpeech Codec: Transcoding Functions,” 3GPP Technical Specification]. Theaverage correlation rx is defined as{tilde over (r)} _(x)=0.5(r _(x)(1)+r _(x)(2))  (1)where r_(x)(1), r_(x)(2) are respectively the normalized correlation ofthe second half of the current frame and of the look-ahead. In thisillustrative embodiment, a look-ahead of 13 ms is used unlike the AMR-WBstandard that uses 5 ms. The normalized correlation r_(x)(k) is computedas follows:

$\begin{matrix}\begin{matrix}{{{r_{x}(k)} = \frac{r_{x\; y}}{\sqrt{r_{x\; x},r_{y\; y}}}},} \\{where} \\{r_{x\; y} = {\sum\limits_{i = 0}^{{Lk} - 1}{{x\left( {t_{k} + i} \right)} \cdot {x\left( {t_{k} + i - p_{k}} \right)}}}} \\{r_{x\; x} = {\sum\limits_{i = 0}^{{Lk} - 1}{x^{2}\left( {t_{k} + i} \right)}}} \\{r_{y\; y} = {\sum\limits_{i = 0}^{{Lk} - 1}{x^{2}\left( {t_{k} + i - p_{k}} \right)}}}\end{matrix} & (2)\end{matrix}$

The correlations r_(x)(k) are computed using the weighted speech signals_(w)(n). The instants t_(k) are related to the current frame beginningand are equal to 64 and 128 samples respectively at the sampling rate orfrequency of 6.4 kHz (10 and 20 ms). The values p_(k)=T_(OL) are theselected open-loop pitch estimates. The length of the autocorrelationcomputation L_(k) is dependant on the pitch period. The values of L_(k)are summarized below (for the sampling rate of 6.4 kHz):

-   -   L_(k)=40 samples for p_(k)≦31 samples    -   L_(k)=62 samples for p_(k)≦61 samples    -   L_(k)=115 samples for p_(k)>61 samples

These lengths assure that the correlated vector length comprises atleast one pitch period which helps for a robust open-loop pitchdetection. For long pitch periods (p₁>61 samples), r_(x)(1) and r_(x)(2)are identical, i.e. only one correlation is computed since thecorrelated vectors are long enough so that the analysis on thelook-ahead is no longer necessary.

The spectral tilt parameter e_(t) contains the information about thefrequency distribution of energy. In the present illustrativeembodiment, the spectral tilt is estimated as a ratio between the energyconcentrated in low frequencies and the energy concentrated in highfrequencies. However, it can also be estimated in different ways such asa ratio between the two first autocorrelation coefficients of the speechsignal.

The discrete Fourier Transform is used to perform the spectral analysisin the spectral analysis and spectrum energy estimation module 500 ofFIG. 5. The frequency analysis and the tilt computation are done twiceper frame. 256 points Fast Fourier Transform (FFT) is used with a 50percent overlap. The analysis windows are placed so that all the lookahead is exploited. In this illustrative embodiment, the beginning ofthe first window is placed 24 samples after the beginning of the currentframe. The second window is placed 128 samples further. Differentwindows can be used to weight the input signal for the frequencyanalysis. A square root of a Hamming window (which is equivalent to asine window) has been used in the present illustrative embodiment. Thiswindow is particularly well suited for overlap-add methods. Therefore,this particular spectral analysis can be used in an optional noisesuppression algorithm based on spectral subtraction and overlap-addanalysis/synthesis.

The energy in high frequencies and in low frequencies is computed inmodule 500 of FIG. 5 following the perceptual critical bands. In thepresent illustrative embodiment each critical band is considered up tothe following number [J. D. Johnston, “Transform Coding of Audio SignalsUsing Perceptual Noise Criteria,” IEEE Jour. on Selected Areas inCommunications, vol. 6, no. 2, pp. 314-323]:

Critical bands {100.0, 200.0, 300.0, 400.0, 510.0, 630.0, 770.0, 920.0,1080.0, 1270.0, 1480.0, 1720.0, 2000.0, 2320.0, 2700.0, 3150.0, 3700.0,4400.0, 5300.0, 6350.0} Hz.

The energy in higher frequencies is computed in module 500 as theaverage of the energies of the last two critical bands:Ē _(h)=0.5(e(18)+e(19))  (3)where the critical band energies e(i) are computed as a sum of the binenergies within the critical band, averaged by the number of the bins.

The energy in lower frequencies is computed as the average of theenergies in the first 10 critical bands. The middle critical bands havebeen excluded from the computation to improve the discrimination betweenframes with high energy concentration in low frequencies (generallyvoiced) and with high energy concentration in high frequencies(generally unvoiced). In between, the energy content is notcharacteristic for any of the classes and would increase the decisionconfusion.

In module 500, the energy in low frequencies is computed differently forlong pitch periods and short pitch periods. For voiced female speechsegments, the harmonic structure of the spectrum can be exploited toincrease the voiced-unvoiced discrimination. Thus for short pitchperiods, Ē₁ is computed bin-wise and only frequency bins sufficientlyclose to the speech harmonics are taken into account in the summation,i.e.

$\begin{matrix}{{\overset{\_}{E}}_{l} = {\frac{1}{cnt} \cdot {\sum\limits_{i = 0}^{24}{e_{b}(i)}}}} & (4)\end{matrix}$where e_(b)(i) are the bin energies in the first 25 frequency bins (theDC component is not considered). Note that these 25 bins correspond tothe first 10 critical bands. In the above summation, only terms relatedto the bins closer to the nearest harmonics than a certain frequencythreshold are non zero. The counter cnt equals to the number of thosenon-zero terms. The threshold for a bin to be included in the sum hasbeen fixed to 50 Hz, i.e. only bins closer than 50 Hz to the nearestharmonics are taken into account. Hence, if the structure is harmonic inlow frequencies, only high energy term will be included in the sum. Onthe other hand, if the structure is not harmonic, the selection of theterms will be random and the sum will be smaller. Thus even unvoicedsounds with high energy content in low frequencies can be detected. Thisprocessing cannot be done for longer pitch periods, as the frequencyresolution is not sufficient. The threshold pitch value is 128 samplescorresponding to 100 Hz. It means that for pitch periods longer than 128samples and also for a priori unvoiced sounds (i.e. when r _(x)+re<0.6),the low frequency energy estimation is done per critical band and iscomputed as

$\begin{matrix}{{\overset{\_}{E}}_{l} = {\frac{1}{10} \cdot {\sum\limits_{i = 0}^{9}{e(i)}}}} & (5)\end{matrix}$

The value r_(e), calculated in a noise estimation and normalizedcorrelation correction module 501, is a correction added to thenormalized correlation in presence of background noise for the followingreason. In the presence of background noise, the average normalizedcorrelation decreases. However, for purpose of signal classification,this decrease should not affect the voiced-unvoiced decision. It hasbeen found that the dependence between this decrease re and the totalbackground noise energy in dB is approximately exponential and can beexpressed using following relationshipr _(e)=2.4492·10⁻⁴ ·e ^(0.1596·NdB)−0.022where N_(dB) stands for

$N_{dB} = {{10 \cdot {\log_{10}\left( {\frac{1}{20}\;{\sum\limits_{i = 0}^{19}{n(i)}}} \right)}} - g_{dB}}$Here, n(i) are the noise energy estimates for each critical bandnormalized in the same way as e(i) and g_(dB) is the maximum noisesuppression level in dB allowed for the noise reduction routine. Thevalue re is not allowed to be negative. it should be noted that when agood noise reduction algorithm is used and g_(dB) is sufficiently high,r_(e) is practically equal to zero. It is only relevant when the noisereduction is disabled or if the background noise level is significantlyhigher than the maximum allowed reduction. The influence of r_(e) can betuned by multiplying this term with a constant.

Finally, the resulting lower and higher frequency energies are obtainedby subtracting an estimated noise energy from the values and Ē₁ and Ē₁calculated above. That isE _(h) =Ē _(h) −f _(c) ·N _(h)  (6)E₁Ē₁−f_(c)·N_(l)  (7)where N_(h) and N_(l) are the averaged noise energies in the last two(2) critical bands and first ten (10) critical bands, respectively,computed using equations similar to Equations (3) and (5), and f_(c) isa correction factor tuned so that these measures remain close toconstant with varying the background noise level. In this illustrativeembodiment, the value of f_(c) has been fixed to 3.

The spectral tilt et is calculated in the spectral tilt estimationmodule 503 using the relation:

$\begin{matrix}{e_{t} = \frac{E_{l}}{E_{h}}} & (8)\end{matrix}$and it is averaged in the dB domain for the two (2) frequency analysesperformed per frame:e _(t)=10·log₁₀(e _(t)(0)·e _(t)(1))

The signal to noise ratio (SNR) measure exploits the fact that for ageneral waveform matching encoder, the SNR is much higher for voicedsounds. The snr parameter estimation must be done at the end of theencoder subframe loop and is computed in the SNR computation module 504using the relation:

$\begin{matrix}{{snr} = \frac{E_{sw}}{E_{e}}} & (9)\end{matrix}$where E_(sw) is the energy of the weighted speech signal s_(w)(n) of thecurrent frame from the perceptual weighting filter 205 and E_(e) is theenergy of the error between this weighted speech signal and the weightedsynthesis signal of the current frame from the perceptual weightingfilter 205′.

The pitch stability counter PC assesses the variation of the pitchperiod. It is computed within the signal classification module 505 inresponse to the open-loop pitch estimates as follows:pc=|p ₁ −p ₀ |+|p ₂ −p ₁|  (10)

The values p₀, p₁, p₂ correspond to the open-loop pitch estimatescalculated by the open-loop pitch search module 206 from the first halfof the current frame, the second half of the current frame and thelook-ahead, respectively.

The relative frame energy E_(s) is computed by module 500 as adifference between the current frame energy in dB and its long-termaverageE _(s) =Ē _(f) −E _(lt)where the frame energy Ē_(f) is obtained as a summation of the criticalband energies, averaged for the both spectral analysis performed eachframe:E _(f)=10 log₁₀(0.5E _(f)(0)+E _(f)(1)))

${E_{f}(j)} = {\sum\limits_{i = 0}^{19}{e(i)}}$The long-term averaged energy is updated on active speech frames usingthe following relation:E _(lt)=0.99E _(lt)+0.01E _(f)

The last parameter is the zero-crossing parameter zc computed on oneframe of the speech signal by the zero-crossing computation module 508.The frame starts in the middle of the current frame and uses two (2)subframes of the look-ahead. In this illustrative embodiment, thezero-crossing counter zc counts the number of times the signal signchanges from positive to negative during that interval.

To make the classification more robust, the classification parametersare considered together forming a function of merit fm. For thatpurpose, the classification parameters are first scaled between 0 and 1so that each parameter's value typical for unvoiced signal translates in0 and each parameter's value typical for voiced signal translatesinto 1. A linear function is used between them. Let us consider aparameter px, its scaled version is obtained using:p ^(s) =k _(p) ·p _(x) +c _(p)and clipped between 0 and 1. The function coefficients k_(p) and c_(p)have been found experimentally for each of the parameters so that thesignal distortion due to the concealment and recovery techniques used inpresence of FERs is minimal. The values used in this illustrativeimplementation are summarized in Table 2:

TABLE 2 Signal Classification Parameters and the coefficients of theirrespective scaling functions Parameter Meaning k_(p) c_(p) r _(x)Normalized Correlation 2.857 −1.286 ē_(t) Spectral Tilt 0.04167 0 snrSignal to Noise Ratio 0.1111 −0.3333 pc Pitch Stability counter −0.071431.857 E_(s) Relative Frame Energy 0.05 0.45 zc Zero Crossing Counter−0.04 2.4

The merit function has been defined as:

$f_{m} = {\frac{1}{7}\left( {{2 \cdot {\overset{\_}{r}}_{x}^{s}} + {\overset{\_}{e}}_{t}^{s} + {snr}^{s} + {p\; c^{s}} + E_{s}^{s} + {z\; c^{s}}} \right)}$where the superscript s indicates the scaled version of the parameters.

The classification is then done using the merit function f_(m) andfollowing the rules summarized in Table 3:

TABLE 3 Signal Classification Rules at the Encoder Previous Frame ClassRule Current Frame Class ONSET f_(m) = 0.66 VOICED VOICED VOICEDTRANSITION 0.66 > f_(m) = 0.49 VOICED TRANSITION UNVOICED f_(m) < 0.49UNVOICED TRANSITION f_(m) > 0.63 ONSET UNVOICED 0.63 = f_(m) > 0.585UNVOICED TRANSITION f_(m) = 0.585 UNVOICED

In case of source-controlled variable bit rate (VBR) encoder, a signalclassification is inherent to the codec operation. The codec operates atseveral bit rates, and a rate selection module is used to determine thebit rate used for encoding each speech frame based on the nature of thespeech frame (e.g. voiced, unvoiced, transient, background noise framesare each encoded with a special encoding algorithm). The informationabout the coding mode and thus about the speech class is already animplicit part of the bitstream and need not be explicitly transmittedfor FER processing. This class information can be then used to overwritethe classification decision described above.

In the example application to the AMR WB codec, the onlysource-controlled rate selection represents the voice activity detection(VAD). This VAD flag equals 1 for active speech, 0 for silence. Thisparameter is useful for the classification as it directly indicates thatno further classification is needed if its value is 0 (i.e. the frame isdirectly classified as UNVOICED). This parameter is the output of thevoice activity detection (VAD) module 402. Different VAD algorithmsexist in the literature and any algorithm can be used for the purpose ofthe present invention. For instance the VAD algorithm that is part ofstandard G.722.2 can be used [ITU-T Recommendation G.722.2 “Widebandcoding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband(AMR-WB)”, Geneva, 2002]. Here, the VAD algorithm is based on the outputof the spectral analysis of module 500 (based on signal-to-noise ratioper critical band). The VAD used for the classification purpose differsfrom the one used for encoding purpose with respect to the hangover. Inspeech encoders using a comfort noise generation (CNG) for segmentswithout active speech (silence or noise-only), a hangover is often addedafter speech spurts (CNG in AMR-WB standard is an example [3GPP TS26.192, “AMR Wideband Speech Codec: Comfort Noise Aspects,” 3GPPTechnical Specification]). During the hangover, the speech encodercontinues to be used and the system switches to the CNG only after thehangover period is over. For the purpose of classification for FERconcealment, this high security is not needed. Consequently, the VADflag for the classification will equal to 0 also during the hangoverperiod.

In this illustrative embodiment, the classification is performed inmodule 505 based on the parameters described above; namely, normalizedcorrelations (or voicing information) r_(x), spectral tilt e_(t), snr,pitch stability counter pc, relative frame energy E_(s), zero crossingrate zc, and VAD flag.

Classification at the Decoder

If the application does not permit the transmission of the classinformation (no extra bits can be transported), the classification canbe still performed at the decoder. As already noted, the maindisadvantage here is that there is generally no available look ahead inspeech decoders. Also, there is often a need to keep the decodercomplexity limited.

A simple classification can be done by estimating the voicing of thesynthesized signal. If we consider the case of a CELP type encoder, thevoicing estimate r_(v) computed as in Equation (1) can be used. That is:r _(v)=(E _(v) −E _(c))/(E _(v) +E _(c))where E_(v) is the energy of the scaled pitch codevector bv_(T) andE_(c) is the energy of the scaled innovative codevector gc_(k).Theoretically, for a purely voiced signal rv=1 and for a purely unvoicedsignal r_(v)=−1. The actual classification is done by averaging r_(v)values every 4 subframes. The resulting factor f_(rv) (average of r_(v)values of every four subframes) is used as follows

TABLE 4 Signal Classification Rules at the Decoder Previous Frame ClassRule Current Frame Class ONSET f_(rv) > −0.1 VOICED VOICED VOICEDTRANSITION −0.1 = f_(rv) = −0.5 VOICED TRANSITION UNVOICED f_(rv) < −0.5UNVOICED TRANSITION f_(rv) > −0.1 ONSET UNVOICED −0.1 = f_(rv) = −0.5UNVOICED TRANSITION f_(rv) < −0.5 UNVOICED

Similarly to the classification at the encoder, other parameters can beused at the decoder to help the classification, as the parameters of theLP filter or the pitch stability.

In case of source-controlled variable bit rate coder, the informationabout the coding mode is already a part of the bitstream. Hence, if forexample a purely unvoiced coding mode is used, the frame can beautomatically classified as UNVOICED. Similarly, if a purely voicedcoding mode is used, the frame is classified as VOICED.

Speech Parameters for FER Processing

There are few critical parameters that must be carefully controlled toavoid annoying artifacts when FERs occur. If few extra bits can betransmitted then these parameters can be estimated at the encoder,quantized, and transmitted. Otherwise, some of them can be estimated atthe decoder. These parameters include signal classification, energyinformation, phase information, and voicing information. The mostimportant is a precise control of the speech energy. The phase and thespeech periodicity can be controlled too for further improving the FERconcealment and recovery.

The importance of the energy control manifests itself mainly when anormal operation recovers after an erased block of frames. As most ofspeech encoders make use of a prediction, the right energy cannot beproperly estimated at the decoder. In voiced speech segments, theincorrect energy can persist for several consecutive frames which isvery annoying especially when this incorrect energy increases.

Even if the energy control is most important for voiced speech becauseof the long term prediction (pitch prediction), it is important also forunvoiced speech. The reason here is the prediction of the innovationgain quantizer often used in CELP type coders. The wrong energy duringunvoiced segments can cause an annoying high frequency fluctuation.

The phase control can be done in several ways, mainly depending on theavailable bandwidth. In our implementation, a simple phase control isachieved during lost voiced onsets by searching the approximateinformation about the glottal pulse position.

Hence, apart from the signal classification information discussed in theprevious section, the most important information to send is theinformation about the signal energy and the position of the firstglottal pulse in a frame (phase information). If enough bandwidth isavailable, a voicing information can be sent, too.

Energy Information

The energy information can be estimated and sent either in the LPresidual domain or in the speech signal domain. Sending the informationin the residual domain has the disadvantage of not taking into accountthe influence of the LP synthesis filter. This can be particularlytricky in the case of voiced recovery after several lost voiced frames(when the FER happens during a voiced speech segment). When a FERarrives after a voiced frame, the excitation of the last good frame istypically used during the concealment with some attenuation strategy.When a new LP synthesis filter arrives with the first good frame afterthe erasure, there can be a mismatch between the excitation energy andthe gain of the LP synthesis filter. The new synthesis filter canproduce a synthesis signal with an energy highly different from theenergy of the last synthesized erased frame and also from the originalsignal energy. For this reason, the energy is computed and quantized inthe signal domain.

The energy E_(q) is computed and quantized in energy estimation andquantization module 506. It has been found that 6 bits are sufficient totransmit the energy. However, the number of bits can be reduced withouta significant effect if not enough bits are available. In this preferredembodiment, a 6 bit uniform quantizer is used in the range of −15 dB to83 dB with a step of 1.58 dB. The quantization index is given by theinteger part of:

$\begin{matrix}{i = \frac{{10\mspace{11mu}{\log_{10}\left( {E + 0.001} \right)}} + 15}{1.58}} & (15)\end{matrix}$where E is the maximum of the signal energy for frames classified asVOICED or ONSET, or the average energy per sample for other frames. ForVOICED or ONSET frames, the maximum of signal energy is computed pitchsynchronously at the end of the frame as follow:

$\begin{matrix}{E = {\max\limits_{i = {L - t_{E}}}^{L - 1}\left( {s^{2}(i)} \right)}} & (16)\end{matrix}$where L is the frame length and signal s(i) stands for speech signal (orthe denoised speech signal if a noise suppression is used). In thisillustrative embodiment s(i) stands for the input signal afterdownsampling to 12.8 kHz and pre-processing. If the pitch delay isgreater than 63 samples, t_(E) equals the rounded close-loop pitch lagof the last subframe. If the pitch delay is shorter than 64 samples,then t_(E) is set to twice the rounded close-loop pitch lag of the lastsubframe.

For other classes, E is the average energy per sample of the second halfof the current frame, i.e. t_(E) is set to L/2 and the E is computed as:

$\begin{matrix}{E = {\frac{1}{t_{E}}\;{\sum\limits_{i = {L - t_{E}}}^{L - 1}{s^{2}(i)}}}} & (17)\end{matrix}$Phase Control Information

The phase control is particularly important while recovering after alost segment of voiced speech for similar reasons as described in theprevious section. After a block of erased frames, the decoder memoriesbecome desynchronized with the encoder memories. To resynchronize thedecoder, some phase information can be sent depending on the availablebandwidth. In the described illustrative implementation, a roughposition of the first glottal pulse in the frame is sent. Thisinformation is then used for the recovery after lost voiced onsets aswill be described later.

Let T₀ be the rounded closed-loop pitch lag for the first subframe.First glottal pulse search and quantization module 507 searches theposition of the first glottal pulse τ among the T₀ first samples of theframe by looking for the sample with the maximum amplitude. Best resultsare obtained when the position of the first glottal pulse is measured onthe low-pass filtered residual signal.

The position of the first glottal pulse is coded using 6 bits in thefollowing manner. The precision used to encode the position of the firstglottal pulse depends on the closed-loop pitch value for the firstsubframe T₀. This is possible because this value is known both by theencoder and the decoder, and is not subject to error propagation afterone or several frame losses. When T₀ is less than 64, the position ofthe first glottal pulse relative to the beginning of the frame isencoded directly with a precision of one sample. When 64=T₀<128, theposition of the first glottal pulse relative to the beginning of theframe is encoded with a precision of two samples by using a simpleinteger division, i.e. τ/2. When T₀=128, the position of the firstglottal pulse relative to the beginning of the frame is encoded with aprecision of four samples by further dividing τ by 2. The inverseprocedure is done at the decoder. If T₀<64, the received quantizedposition is used as is. If 64=T₀<128, the received quantized position ismultiplied by 2 and incremented by 1. If T₀=128, the received quantizedposition is multiplied by 4 and incremented by 2 (incrementing by 2results in uniformly distributed quantization error).

According to another embodiment of the invention where the shape of thefirst glottal pulse is encoded, the position of the first glottal pulseis determined by a correlation analysis between the residual signal andthe possible pulse shapes, signs (positive or negative) and positions.The pulse shape can be taken from a codebook of pulse shapes known atboth the encoder and the decoder, this method being known as vectorquantization by those of ordinary skill in the art. The shape, sign andamplitude of the first glottal pulse are then encoded and transmitted tothe decoder.

Periodicity Information

In case there is enough bandwidth, a periodicity information, or voicinginformation, can be computed and transmitted, and used at the decoder toimprove the frame erasure concealment. The voicing information isestimated based on the normalized correlation. It can be encoded quiteprecisely with 4 bits, however, 3 or even 2 bits would suffice ifnecessary. The voicing information is necessary in general only forframes with some periodic components and better voicing resolution isneeded for highly voiced frames. The normalized correlation is given inEquation (2) and it is used as an indicator to the voicing Information.It is quantized in first glottal pulse search and quantization module507. In this illustrative embodiment, a piece-wise linear quantizer hasbeen used to encode the voicing information as follows:

$\begin{matrix}\begin{matrix}{{i = {\frac{{r_{x}(2)} - 0.65}{0.03} + 0.5}},} & {{{for}\mspace{14mu}{r_{X}(2)}} < 0.92}\end{matrix} & (18) \\\begin{matrix}{{i = {9 + \frac{{r_{x}(2)} - 0.92}{0.01} + 0.5}},} & {{{for}\mspace{14mu}{r_{X}(2)}} \geq 0.92}\end{matrix} & (19)\end{matrix}$

Again, the integer part of i is encoded and transmitted. The correlationr_(x)(2) has the same meaning as in Equation (1). In Equation (18) thevoicing is linearly quantized between 0.65 and 0.89 with the step of0.03. In Equation (19) the voicing is linearly quantized between 0.92and 0.98 with the step of 0.01.

If larger quantization range is needed, the following linearquantization can be used:

$\begin{matrix}{i = {\frac{{\overset{\_}{r}}_{x} - 0.4}{0.04} + 0.5}} & (20)\end{matrix}$This equation quantizes the voicing in the range of 0.4 to 1 with thestep of 0.04. The correlation r _(x) is defined in Equation (2a).

The equations (18) and (19) or the equation (20) are then used in thedecoder to compute r_(x)(2) or r _(x). Let us call this quantizednormalized correlation r_(q). If the voicing cannot be transmitted, itcan be estimated using the voicing factor from Equation (2a) by mappingit in the range from 0 to 1.r _(q)=0.5·(f+1)  (21)Processing of Erased Frames

The FER concealment techniques in this illustrative embodiment aredemonstrated on ACELP type encoders. They can be however easily appliedto any speech codec where the synthesis signal is generated by filteringan excitation signal through an LP synthesis filter. The concealmentstrategy can be summarized as a convergence of the signal energy and thespectral envelope to the estimated parameters of the background noise.The periodicity of the signal is converging to zero. The speed of theconvergence is dependent on the parameters of the last good receivedframe class and the number of consecutive erased frames and iscontrolled by an attenuation factor α. The factor α is further dependenton the stability of the LP filter for UNVOICED frames. In general, theconvergence is slow if the last good received frame is in a stablesegment and is rapid if the frame is in a transition segment. The valuesof a are summarized in Table 5.

TABLE 5 Values of the FER concealment attenuation factor α Last GoodReceived Number of successive Frame erased frames α ARTIFICIAL ONSET 0.6ONSET, VOICED =3 1.0 >3 0.4 VOICED TRANSITION 0.4 UNVOICED TRANSITION0.8 UNVOICED =1 0.6 θ + 0.4 >1 0.4

A stability factor θ is computed based on a distance measure between theadjacent LP filters. Here, the factor θ is related to the ISF(Immittance Spectral Frequencies) distance measure and it is bounded by0≦θ≦1, with larger values of θ corresponding to more stable signals.This results in decreasing energy and spectral envelope fluctuationswhen an isolated frame erasure occurs inside a stable unvoiced segment.

The signal class remains unchanged during the processing of erasedframes, i.e. the class remains the same as in the last good receivedframe.

Construction of the Periodic Part of the Excitation

For a concealment of erased frames following a correctly receivedUNVOICED frame, no periodic part of the excitation signal is generated.For a concealment of erased frames following a correctly received frameother than UNVOICED, the periodic part of the excitation signal isconstructed by repeating the last pitch period of the previous frame. Ifit is the case of the 1 st erased frame after a good frame, this pitchpulse is first low-pass filtered. The filter used is a simple 3-taplinear phase FIR filter with filter coefficients equal to 0.18, 0.64 and0.18. If a voicing information is available, the filter can be alsoselected dynamically with a cut-off frequency dependent on the voicing.

The pitch period T_(c) used to select the last pitch pulse and henceused during the concealment is defined so that pitch multiples orsubmultiples can be avoided, or reduced. The following logic is used indetermining the pitch period T_(c).if ((T ₃<1.8 T _(s)) AND (T ₃>0.6 T _(s))) OR (T _(cnt)=30), then T_(c)=T₃, else T_(c)=T_(s).Here, T₃ is the rounded pitch period of the 4^(th) subframe of the lastgood received frame and T_(s) is the rounded pitch period of the 4^(th)subframe of the last good stable voiced frame with coherent pitchestimates. A stable voiced frame is defined here as a VOICED framepreceded by a frame of voiced type (VOICED TRANSITION, VOICED, ONSET).The coherence of pitch is verified in this implementation by examiningwhether the closed-loop pitch estimates are reasonably close, i.e.whether the ratios between the last subframe pitch, the 2nd subframepitch and the last subframe pitch of the previous frame are within theinterval (0.7, 1.4).

This determination of the pitch period T_(c) means that if the pitch atthe end of the last good frame and the pitch of the last stable frameare close to each other, the pitch of the last good frame is used.Otherwise this pitch is considered unreliable and the pitch of the laststable frame is used instead to avoid the impact of wrong pitchestimates at voiced onsets. This logic makes however sense only if thelast stable segment is not too far in the past. Hence a counter T_(cnt)is defined that limits the reach of the influence of the last stablesegment. If T_(cnt) is greater or equal to 30, i.e. if there are atleast 30 frames since the last T_(s) update, the last good frame pitchis used systematically. T_(cnt) is reset to 0 every time a stablesegment is detected and T_(s) is updated. The period T_(c) is thenmaintained constant during the concealment for the whole erased block.

As the last pulse of the excitation of the previous frame is used forthe construction of the periodic part, its gain is approximately correctat the beginning of the concealed frame and can be set to 1. The gain isthen attenuated linearly throughout the frame on a sample by samplebasis to achieve the value of α at the end of the frame.

The values of α correspond to the Table 5 with the exception that theyare modified for erasures following VOICED and ONSET frames to take intoconsideration the energy evolution of voiced segments. This evolutioncan be extrapolated to some extend by using the pitch excitation gainvalues of each subframe of the last good frame. In general, if thesegains are greater than 1, the signal energy is increasing, if they arelower than 1, the energy is decreasing. α is thus multiplied by acorrection factor f_(b) computed as follows:f _(b)=√{square root over (0.1b(0)+0.2b(1)+0.3b(2)+0.4b(3))}{square rootover (0.1b(0)+0.2b(1)+0.3b(2)+0.4b(3))}{square root over(0.1b(0)+0.2b(1)+0.3b(2)+0.4b(3))}{square root over(0.1b(0)+0.2b(1)+0.3b(2)+0.4b(3))}  (23)where b(0), b(1), b(2) and b(3) are the pitch gains of the foursubframes of the last correctly received frame. The value of f_(b) isclipped between 0.98 and 0.85 before being used to scale the periodicpart of the excitation. In this way, strong energy increases anddecreases are avoided.

For erased frames following a correctly received frame other thanUNVOICED, the excitation buffer is updated with this periodic part ofthe excitation only. This update will be used to construct the pitchcodebook excitation in the next frame.

Construction of the Random Part of the Excitation

The innovation (non-periodic) part of the excitation signal is generatedrandomly. It can be generated as a random noise or by using the CELPinnovation codebook with vector indexes generated randomly. In thepresent illustrative embodiment, a simple random generator withapproximately uniform distribution has been used. Before adjusting theinnovation gain, the randomly generated innovation is scaled to somereference value, fixed here to the unitary energy per sample.

At the beginning of an erased block, the innovation gain gs isinitialized by using the innovation excitation gains of each subframe ofthe last good frame:g _(s)=0.1g(0)+0.2g(1)+0.3g(2)+0.4g(3)  (23a)where g(0), g(1), g(2) and g(3) are the fixed codebook, or innovation,gains of the four (4) subframes of the last correctly received frame.The attenuation strategy of the random part of the excitation issomewhat different from the attenuation of the pitch excitation. Thereason is that the pitch excitation (and thus the excitationperiodicity) is converging to 0 while the random excitation isconverging to the comfort noise generation (CNG) excitation energy. Theinnovation gain attenuation is done as:g _(s) ¹ =α·g _(s) ⁰+(1−α)·g _(n)  (24)where g_(s) ¹ is the innovation gain at the beginning of the next frame,g_(s) ⁰ is the innovative gain at the beginning of the current frame,g_(n) is the gain of the excitation used during the comfort noisegeneration and a is as defined in Table 5. Similarly to the periodicexcitation attenuation, the gain is thus attenuated linearly throughoutthe frame on a sample by sample basis starting with g_(s) ⁰ and going tothe value of g_(s) ¹ that would be achieved at the beginning of the nextframe.

Finally, if the last good (correctly received or non erased) receivedframe is different from UNVOICED, the innovation excitation is filteredthrough a linear phase FIR high-pass filter with coefficients −0.0125,−0.109, 0.7813, −0.109, −0.0125. To decrease the amount of noisycomponents during voiced segments, these filter coefficients aremultiplied by an adaptive factor equal to (0.75-0.25 r_(v)), r_(v) beingthe voicing factor as defined in Equation (1). The random part of theexcitation is then added to the adaptive excitation to form the totalexcitation signal.

If the last good frame is UNVOICED, only the innovation excitation isused and it is further attenuated by a factor of 0.8. In this case, thepast excitation buffer is updated with the innovation excitation as noperiodic part of the excitation is available.

Spectral Envelope Concealment, Synthesis and Updates

To synthesize the decoded speech, the LP filter parameters must beobtained. The spectral envelope is gradually moved to the estimatedenvelope of the ambient noise. Here the ISF representation of LPparameters is used:l ¹(j)=αl ⁰(j)+(1−α)l _(n)(j), j=0, . . . , p−1  (25)In equation (25), l¹(j) is the value of the j^(th) ISF of the currentframe, 106) is the value of the j^(th) ISF of the previous frame,l^(n)(j) is the value of the j^(th) ISF of the estimated comfort noiseenvelope and p is the order of the LP filter.

The synthesized speech is obtained by filtering the excitation signalthrough the LP synthesis filter. The filter coefficients are computedfrom the ISF representation and are interpolated for each subframe (four(4) times per frame) as during normal encoder operation.

As innovation gain quantizer and ISF quantizer both use a prediction,their memory will not be up to date after the normal operation isresumed. To reduce this effect, the quantizers' memories are estimatedand updated at the end of each erased frame.

Recovery of the Normal Operation After Erasure

The problem of the recovery after an erased block of frames is basicallydue to the strong prediction used practically in all modern speechencoders. In particular, the CELP type speech coders achieve their highsignal to noise ratio for voiced speech due to the fact that they areusing the past excitation signal to encode the present frame excitation(long-term or pitch prediction). Also, most of the quantizers (LPquantizers, gain quantizers) make use of a prediction.

Artificial Onset Construction

The most complicated situation related to the use of the long-termprediction in CELP encoders is when a voiced onset is lost. The lostonset means that the voiced speech onset happened somewhere during theerased block. In this case, the last good received frame was unvoicedand thus no periodic excitation is found in the excitation buffer. Thefirst good frame after the erased block is however voiced, theexcitation buffer at the encoder is highly periodic and the adaptiveexcitation has been encoded using this periodic past excitation. As thisperiodic part of the excitation is completely missing at the decoder, itcan take up to several frames to recover from this loss.

If an ONSET frame is lost (i.e. a VOICED good frame arrives after anerasure, but the last good frame before the erasure was UNVOICED asshown in FIG. 6), a special technique is used to artificiallyreconstruct the lost onset and to trigger the voiced synthesis. At thebeginning of the 1st good frame after a lost onset, the periodic part ofthe excitation is constructed artificially as a low-pass filteredperiodic train of pulses separated by a pitch period. In the presentillustrative embodiment, the low-pass filter is a simple linear phaseFIR filter with the impulse response h_(low)={−0.0125, 0.109, 0.7813,0.109, −0.0125}. However, the filter could be also selected dynamicallywith a cut-off frequency corresponding to the voicing information ifthis information is available. The innovative part of the excitation isconstructed using normal CELP decoding. The entries of the innovationcodebook could be also chosen randomly (or the innovation itself couldbe generated randomly), as the synchrony with the original signal hasbeen lost anyway.

In practice, the length of the artificial onset is limited so that atleast one entire pitch period is constructed by this method and themethod is continued to the end of the current subframe. After that, aregular ACELP processing is resumed. The pitch period considered is therounded average of the decoded pitch periods of all subframes where theartificial onset reconstruction is used. The low-pass filtered impulsetrain is realized by placing the impulse responses of the low-passfilter in the adaptive excitation buffer (previously initialized tozero). The first impulse response will be centered at the quantizedposition τ_(q) (transmitted within the bitstream) with respect to theframe beginning and the remaining impulses will be placed with thedistance of the averaged pitch up to the end of the last subframeaffected by the artificial onset construction. If the availablebandwidth is not sufficient to transmit the first glottal pulseposition, the first impulse response can be placed arbitrarily aroundthe half of the pitch period after the current frame beginning.

As an example, for the subframe length of 64 samples, let us considerthat the pitch periods in the first and the second subframe bep(0)=70.75 and p(1)=71. Since this is larger than the subframe size of64, then the artificial onset will be constructed during the first twosubframes and the pitch period will be equal to the pitch average of thetwo subframes rounded to the nearest integer, i.e. 71. The last twosubframes will be processed by normal CELP decoder.

The energy of the periodic part of the artificial onset excitation isthen scaled by the gain corresponding to the quantized and transmittedenergy for FER concealment (As defined in Equations 16 and 17) anddivided by the gain of the LP synthesis filter. The LP synthesis filtergain is computed as:

$\begin{matrix}{g_{LP} = \sqrt{\sum\limits_{i = 0}^{63}{h^{2}(i)}}} & (31)\end{matrix}$where h(i) is the LP synthesis filter impulse response Finally, theartificial onset gain is reduced by multiplying the periodic part with0.96. Alternatively, this value could correspond to the voicing if therewere a bandwidth available to transmit also the voicing information.Alternatively without diverting from the essence of this invention, theartificial onset can be also constructed in the past excitation bufferbefore entering the decoder subframe loop. This would have the advantageof avoiding the special processing to construct the periodic part of theartificial onset and the regular CELP decoding could be used instead.

The LP filter for the output speech synthesis is not interpolated in thecase of an artificial onset construction. Instead, the received LPparameters are used for the synthesis of the whole frame.

Energy Control

The most important task at the recovery after an erased block of framesis to properly control the energy of the synthesized speech signal. Thesynthesis energy control is needed because of the strong predictionusually used in modem speech coders. The energy control is mostimportant when a block of erased frames happens during a voiced segment.When a frame erasure arrives after a voiced frame, the excitation of thelast good frame is typically used during the concealment with someattenuation strategy. When a new LP filter arrives with the first goodframe after the erasure, there can be a mismatch between the excitationenergy and the gain of the new LP synthesis filter. The new synthesisfilter can produce a synthesis signal with an energy highly differentfrom the energy of the last synthesized erased frame and also from theoriginal signal energy.

The energy control during the first good frame after an erased frame canbe summarized as follows. The synthesized signal is scaled so that itsenergy is similar to the energy of the synthesized speech signal at theend of the last erased frame at the beginning of the first good frameand is converging to the transmitted energy towards the end of the framewith preventing a too important energy increase.

The energy control is done in the synthesized speech signal domain. Evenif the energy is controlled in the speech domain, the excitation signalmust be scaled as it serves as long term prediction memory for thefollowing frames. The synthesis is then redone to smooth thetransitions. Let g₀ denote the gain used to scale the 1st sample in thecurrent frame and g₁ the gain used at the end of the frame. Theexcitation signal is then scaled as follows:u _(s)(i)=g _(AGC)(i)·u(i), i=0, . . . , L−1  (32)where u_(s)(i) is the scaled excitation, u(i) is the excitation beforethe scaling, L is the frame length and g_(AGC)(i) is the gain startingfrom g₀ and converging exponentially to g₁:g _(AGC)(i)=f _(AGC) g _(AGC)(i−1)+(1−f _(AGC))g ₁ i=0, . . . , L−1with the initialization of g_(AGC)(−1)=g₀, where f_(AGC) is theattenuation factor set in this implementation to the value of 0.98. Thisvalue has been found experimentally as a compromise of having a smoothtransition from the previous (erased) frame on one side, and scaling thelast pitch period of the current frame as much as possible to thecorrect (transmitted) value on the other side. This is important becausethe transmitted energy value is estimated pitch synchronously at the endof the frame. The gains g0 and g1 are defined as:g ₀=√{square root over (E ⁻¹ /E ₀)}  (33a)g ₁=√{square root over (E _(q) /E ₁)}  (33b)where E⁻¹ is the energy computed at the end of the previous (erased)frame, E₀ is the energy at the beginning of the current (recovered)frame, E₁ is the energy at the end of the current frame and E_(q) is thequantized transmitted energy information at the end of the currentframe, computed at the encoder from Equations (16, 17). E⁻¹ and E₁ arecomputed similarly with the exception that they are computed on thesynthesized speech signal s′. E⁻¹ is computed pitch synchronously usingthe concealment pitch period T_(c) and E₁ uses the last subframe roundedpitch T₃. E₀ is computed similarly using the rounded pitch value T₀ ofthe first subframe, the equations (16, 17) being modified to:

$E = {\max\limits_{i = 0}^{t_{E}}\left( {s^{\prime\; 2}(i)} \right)}$for VOICED and ONSET frames. t_(E) equals to the rounded pitch lag ortwice that length if the pitch is shorter than 64 samples. For otherframes,

$E = {\frac{1}{t_{0}}{\sum\limits_{i = 0}^{t_{E}}{s^{\prime\; 2}(i)}}}$with t_(E) equal to the half of the frame length. The gains g₀ and g₁are further limited to a maximum allowed value, to prevent strongenergy. This value has been set to 1.2 in the present illustrativeimplementation.

Conducting frame erasure concealment and decoder recovery comprises,when a gain of a LP filter of a first non erased frame receivedfollowing frame erasure is higher than a gain of a LP filter of a lastframe erased during said frame erasure, adjusting the energy of an LPfilter excitation signal produced in the decoder during the receivedfirst non erased frame to a gain of the LP filter of said received firstnon erased frame using the following relation:

If E_(q) cannot be transmitted, E_(q) is set to E₁. If however theerasure happens during a voiced speech segment (i.e. the last good framebefore the erasure and the first good frame after the erasure areclassified as VOICED TRANSITION, VOICED or ONSET), further precautionsmust be taken because of the possible mismatch between the excitationsignal energy and the LP filter gain, mentioned previously. Aparticularly dangerous situation arises when the gain of the LP filterof a first non erased frame received following frame erasure is higherthan the gain of the LP filter of a last frame erased during that frameerasure. In that particular case, the energy of the LP filter excitationsignal produced in the decoder during the received first non erasedframe is adjusted to a gain of the LP filter of the received first nonerased frame using the following relation:

$E_{q} = {E_{1}\frac{E_{{LP}\; 0}}{E_{{LP}\; 1}}}$where E_(LPO) is the energy of the LP filter impulse response of thelast good frame before the erasure and E_(LP1) is the energy of the LPfilter of the first good frame after the erasure. In thisimplementation, the LP filters of the last subframes in a frame areused. Finally, the value of E_(q) is limited to the value of E⁻¹ in thiscase (voiced segment erasure without E_(q) information beingtransmitted).

The following exceptions, all related to transitions in speech signal,further overwrite the computation of g₀. If artificial onset is used inthe current frame, g₀ is set to 0.5 g₁, to make the onset energyincrease gradually.

In the case of a first good frame after an erasure classified as ONSET,the gain g₀ is prevented to be higher that g₁. This precaution is takento prevent a positive gain adjustment at the beginning of the frame(which is probably still at least partially unvoiced) from amplifyingthe voiced onset (at the end of the frame).

Finally, during a transition from voiced to, unvoiced (i.e. that lastgood frame being classified as VOICED TRANSITION, VOICED or ONSET andthe current frame being classified UNVOICED) or during a transition froma non-active speech period to active speech period (last good receivedframe being encoded as comfort noise and current frame being encoded asactive speech), the g₀ is set to g₁.

In case of a voiced segment erasure, the wrong energy problem canmanifest itself also in frames following the first good frame after theerasure. This can happen even if the first good frame's energy has beenadjusted as described above. To attenuate this problem, the energycontrol can be continued up to the end of the voiced segment.

Although the present invention has been described in the foregoingdescription in relation to an illustrative embodiment thereof, thisillustrative embodiment can be modified as will, within the scope of theappended claims without departing from the scope and spirit of thesubject invention.

1. A method of concealing frame erasure caused by frames of an encodedsound signal erased during transmission from an encoder to a decoder,comprising: determining, in the encoder, concealment/recovery parametersrelated to the sound signal; transmitting to the decoderconcealment/recovery parameters determined in the encoder; and in thedecoder, conducting frame erasure concealment and decoder recovery inresponse to the received concealment/recovery parameters; wherein:conducting frame erasure concealment and decoder recovery comprises,when at least one onset frame is lost, constructing a periodicexcitation part artificially as a low-pass filtered periodic train ofpulses separated by a pitch period; the method comprises quantizing aposition of a first glottal pulse with respect to the beginning of theonset frame prior to transmission of said position of the first glottalpulse to the decoder; and constructing the periodic excitation partcomprises realizing the low-pass filtered periodic train of pulses by:centering a first impulse response of a low-pass filter on the quantizedposition of the first glottal pulse with respect to the beginning of theonset frame; and placing remaining impulse responses of the low-passfilter each with a distance corresponding to an average pitch value fromthe preceding impulse response up to the end of a last subframe affectedby the artificial construction of the periodic part.
 2. A method ofconcealing frame erasure caused by frames of an encoded sound signalerased during transmission from an encoder to a decoder, comprising:determining, in the encoder, concealment/recovery parameters selectedfrom the group consisting of a signal classification parameter, anenergy information parameter, and a phase information parameter relatedto the sound signal; transmitting to the decoder concealment/recoveryparameters determined in the encoder; and in the decoder, conductingframe erasure concealment and decoder recovery in response to thereceived concealment/recovery parameters; wherein theconcealment/recovery parameters include the phase information parameterand wherein determination of the phase information parameter comprises:determining a position of a first glottal pulse in a frame of theencoded sound signal; and encoding, in the encoder, a shape, sign andamplitude of the first glottal pulse and transmitting the encoded shape,sign and amplitude from the encoder to the decoder.
 3. A method ofconcealing frame erasure caused by frames of an encoded sound signalerased during transmission from an encoder to a decoder, comprising:determining, in the encoder, concealment/recovery parameters selectedfrom the group consisting of a signal classification parameter, anenergy information parameter, and a phase information parameter relatedto the sound signal; transmitting to the decoder concealment/recoveryparameters determined in the encoder; and in the decoder, conductingframe erasure concealment and decoder recovery in response to thereceived concealment/recovery parameters; wherein: theconcealment/recovery parameters include the phase information parameter;determination of the phase information parameter comprises determining aposition of a first glottal pulse in a frame of the encoded soundsignal; and determining the position of the first glottal pulsecomprises: measuring a sample of maximum amplitude within a pitch periodas the first glottal pulse; and quantizing a position of the sample ofmaximum amplitude within the pitch period.
 4. A method of concealingframe erasure caused by frames of an encoded sound signal erased duringtransmission from an encoder to a decoder, comprising: determining, inthe encoder, concealment/recovery parameters selected from the groupconsisting of a signal classification parameter, an energy informationparameter, and a phase information parameter related to the soundsignal; transmitting to the decoder concealment/recovery parametersdetermined in the encoder; and in the decoder, conducting frame erasureconcealment and decoder recovery in response to the receivedconcealment/recovery parameters; wherein: the sound signal is a speechsignal; determining, in the encoder, concealment/recovery parameterscomprises classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset; anddetermining concealment/recovery parameters comprises calculating theenergy information parameter in relation to a maximum of a signal energyfor frames classified as voiced or onset, and calculating the energyinformation parameter in relation to an average energy per sample forother frames.
 5. A method of concealing frame erasure caused by framesof an encoded sound signal erased during transmission from an encoder toa decoder, comprising: determining, in the encoder, concealment/recoveryparameters selected from the group consisting of a signal classificationparameter, an energy information parameter, and a phase informationparameter related to the sound signal; transmitting to the decoderconcealment/recovery parameters determined in the encoder; and in thedecoder, conducting frame erasure concealment and decoder recovery inresponse to the received concealment/recovery parameters; whereinconducting frame erasure concealment and decoder recovery comprises:controlling an energy of a synthesized sound signal produced by thedecoder, controlling energy of the synthesized sound signal comprisingscaling the synthesized sound signal to render an energy of saidsynthesized sound signal at the beginning of a first non erased framereceived following frame erasure similar to an energy of saidsynthesized sound signal at the end of a last frame erased during saidframe erasure; and converging the energy of the synthesized sound signalin the received first non erased frame to an energy corresponding to thereceived energy information parameter toward the end of said receivedfirst non erased frame while limiting an increase in energy.
 6. A methodas claimed in claim 5, wherein: the sound signal is a speech signal;determining, in the encoder, concealment/recovery parameters comprisesclassifying successive frames of the encoded sound signal as unvoiced,unvoiced transition, voiced transition, voiced, or onset; and when thefirst non erased frame received after a frame erasure is classified asonset, conducting frame erasure concealment and decoder recoverycomprises limiting to a given value a gain used for scaling thesynthesized sound signal.
 7. A method as claimed in claim 5, wherein:the sound signal is a speech signal; determining, in the encoder,concealment/recovery parameters comprises classifying successive framesof the encoded sound signal as unvoiced, unvoiced transition, voicedtransition, voiced, or onset; and said method comprising making a gainused for scaling the synthesized sound signal at the beginning of thefirst non erased frame received after frame erasure equal to a gain usedat an end of said received first non erased frame: during a transitionfrom a voiced frame to an unvoiced frame, in the case of a last nonerased frame received before frame erasure classified as voicedtransition, voice or onset and a first non erased frame received afterframe erasure classified as unvoiced; and during a transition from anon-active speech period to an active speech period, when the last nonerased frame received before frame erasure is encoded as comfort noiseand the first non erased frame received after frame erasure is encodedas active speech.
 8. A method of concealing frame erasure caused byframes of an encoded sound signal erased during transmission from anencoder to a decoder, comprising: determining, in the encoder,concealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter, and aphase information parameter related to the sound signal; transmitting tothe decoder concealment/recovery parameters determined in the encoder;and in the decoder, conducting frame erasure concealment and decoderrecovery in response to the received concealment/recovery parameters;wherein: the energy information parameter is not transmitted from theencoder to the decoder; and conducting frame erasure concealment anddecoder recovery comprises, when a gain of a LP filter of a first nonerased frame received following frame erasure is higher than a gain of aLP filter of a last frame erased during said frame erasure, adjusting anenergy of an LP filter excitation signal produced in the decoder duringthe received first non erased frame to the gain of the LP filter of saidreceived first non erased frame.
 9. A method as claimed in claim 8wherein: adjusting the energy of the LP filter excitation signalproduced in the decoder during the received first non erased frame tothe gain of the LP filter of said received first non erased framecomprises using the following relation:$E_{q} = {E_{1}\frac{E_{{LP}\; 0}}{E_{{LP}\; 1}}}$ where E₁ is an energyat an end of the current frame, E_(LPO) is an energy of an impulseresponse of the LP filter of a last non erased frame received before theframe erasure, and E_(LP1) is an energy of an impulse response of the LPfilter of the received first non erased frame following frame erasure.10. A method of concealing frame erasure caused by frames of an encodedsound signal erased during transmission from an encoder to a decoder,comprising: determining, in the encoder, concealment/recovery parametersselected from the group consisting of a signal classification parameter,an energy information parameter and a phase information parameterrelated to the sound signal; and transmitting to the decoderconcealment/recovery parameters determined in the encoder; wherein theconcealment/recovery parameters include the phase information parameterand wherein determination of the phase information parameter comprises:determining a position of a first glottal pulse in a frame of theencoded sound signal; and encoding, in the encoder, a shape, sign andamplitude of the first glottal pulse and transmitting the encoded shape,sign and amplitude from the encoder to the decoder.
 11. A method ofconcealing frame erasure caused by frames of an encoded sound signalerased during transmission from an encoder to a decoder, comprising:determining, in the encoder, concealment/recovery parameters selectedfrom the group consisting of a signal classification parameter, anenergy information parameter and a phase information parameter relatedto the sound signal; and transmitting to the decoderconcealment/recovery parameters determined in the encoder; wherein: theconcealment/recovery parameters include the phase information parameter;determination of the phase information parameter comprises determining aposition of a first glottal pulse in a frame of the encoded soundsignal; and determining the position of the first glottal pulsecomprises: measuring a sample of maximum amplitude within a pitch periodas the first glottal pulse; and quantizing a position of the sample ofmaximum amplitude within the pitch period.
 12. A method for theconcealment of frame erasure caused by frames erased during transmissionof a sound signal encoded under the form of signal-encoding parametersfrom an encoder to a decoder, comprising: determining, in the decoder,concealment/recovery parameters from the signal-encoding parameters,wherein the concealment/recovery parameters are selected from the groupconsisting of a signal classification parameter, an energy informationparameter and a phase information parameter related to the sound signaland are used for producing, upon occurrence of frame erasure, areplacement frame selected from the group consisting of a voiced frame,an unvoiced frame, and a frame defining a transition between voiced andunvoiced frames; and in the decoder, conducting frame erasureconcealment and decoder recovery in response to concealment/recoveryparameters determined in the decoder; wherein: the concealment/recoveryparameters include the energy information parameter; the energyinformation parameter is not transmitted from the encoder to thedecoder; and conducting frame erasure concealment and decoder recoverycomprises, when a gain of a LP filter of a first non erased framereceived following frame erasure is higher than a gain of a LP filter ofa last frame erased during said frame erasure, adjusting an energy of anLP filter excitation signal produced in the decoder during the receivedfirst non erased frame to a gain of the LP filter of said received firstnon erased frame using the following relation:$E_{q} = {E_{1}\frac{E_{{LP}\; 0}}{E_{{LP}\; 1}}}$ where E₁ is an energyat an end of the current frame, E_(LPO) is an energy of an impulseresponse of the LP filter of a last non erased frame received before theframe erasure, and E_(LP1) is an energy of an impulse response of the LPfilter of the received first non erased frame following frame erasure.13. A device for conducting concealment of frame erasure caused byframes of an encoded sound signal erased during transmission from anencoder to a decoder, comprising: in the encoder, a determiner ofconcealment/recovery parameters related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder: wherein: the decoder conductsframe erasure concealment and decoder recovery in response to theconcealment/recovery parameters received from the encoder; forconducting frame erasure concealment and decoder recovery, the decoderconstructs, when at least one onset frame is lost, a periodic excitationpart artificially as a low-pass filtered periodic train of pulsesseparated by a pitch period; the device comprises a quantizer of aposition of a first glottal pulse with respect to the beginning of theonset frame prior to transmission of said position of the first glottalpulse to the decoder; and the decoder, for constructing the periodicexcitation part, realizes the low-pass filtered periodic train of pulsesby: centering a first impulse response of a low-pass filter on thequantized position of the first glottal pulse with respect to thebeginning of the onset frame; and placing remaining impulse responses ofthe low-pass filter each with a distance corresponding to an averagepitch value from the preceding impulse response up to an end of a lastsubframe affected by the artificial construction of the periodic part.14. A device for conducting concealment of frame erasure caused byframes of an encoded sound signal erased during transmission from anencoder to a decoder, comprising: in the encoder, a determiner ofconcealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter and aphase information parameter related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder; wherein: the decoder conductsframe erasure concealment and decoder recovery in response to theconcealment/recovery parameters received from the encoder; theconcealment/recovery parameters include the phase information parameter;to determine the phase information parameter, the determiner comprises asearcher of a position of a first glottal pulse in a frame of theencoded sound signal; the searcher encodes a shape, sign and amplitudeof the first glottal pulse and the communication link transmits theencoded shape, sign and amplitude from the encoder to the decoder.
 15. Adevice for conducting concealment of frame erasure caused by frames ofan encoded sound signal erased during transmission from an encoder to adecoder, comprising: in the encoder, a determiner ofconcealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter and aphase information parameter related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder; wherein: the decoder conductsframe erasure concealment and decoder recovery in response to theconcealment/recovery parameters received from the encoder; theconcealment/recovery parameters include the phase information parameter;to determine the phase information parameter, the determiner comprises asearcher of a position of a first glottal pulse in a frame of theencoded sound signal; and the searcher measures a sample of maximumamplitude within a pitch period as the first glottal pulse, and thedeterminer comprises a quantizer of the position of the sample ofmaximum amplitude within the pitch period.
 16. A device for conductingconcealment of frame erasure caused by frames of an encoded sound signalerased during transmission from an encoder to a decoder, comprising: inthe encoder, a determiner of concealment/recovery parameters selectedfrom the group consisting of a signal classification parameter, anenergy information parameter and a phase information parameter relatedto the sound signal; and a communication link for transmitting to thedecoder concealment/recovery parameters determined in the encoder;wherein: the decoder conducts frame erasure concealment and decoderrecovery in response to the concealment/recovery parameters receivedfrom the encoder; the sound signal is a speech signal; the determiner ofconcealment/recovery parameters comprises a classifier of successiveframes of the encoded sound signal as unvoiced, unvoiced transition,voiced transition, voiced, or onset; and the determiner ofconcealment/recovery parameters comprises a computer of the energyinformation parameter in relation to a maximum of a signal energy forframes classified as voiced or onset, and in relation to an averageenergy per sample for other frames.
 17. A device for conductingconcealment of frame erasure caused by frames of an encoded sound signalerased during transmission from an encoder to a decoder, comprising: inthe encoder, a determiner of concealment/recovery parameters selectedfrom the group consisting of a signal classification parameter, anenergy information parameter and a phase information parameter relatedto the sound signal; and a communication link for transmitting to thedecoder concealment/recovery parameters determined in the encoder;wherein: the decoder conducts frame erasure concealment and decoderrecovery in response to concealment/recovery parameters received fromthe encoder; and for conducting frame erasure concealment and decoderrecovery: the decoder controls an energy of a synthesized sound signalproduced by the decoder by scaling the synthesized sound signal torender an energy of said synthesized sound signal at the beginning of afirst non erased frame received following frame erasure similar to anenergy of said synthesized sound signal at the end of a last frameerased during said frame erasure; and the decoder converges the energyof the synthesized sound signal in the received first non erased frameto an energy corresponding to the received energy information parametertoward the end of said received first non erased frame while limiting anincrease in energy.
 18. A device as claimed in claim 17, wherein: thesound signal is a speech signal; the determiner of concealment/recoveryparameters comprises a classifier of successive frames of the encodedsound signal as unvoiced, unvoiced transition, voiced transition,voiced, or onset; and when the first non erased frame received followingframe erasure is classified as onset, the decoder, for conducting frameerasure concealment and decoder recovery, limits to a given value a gainused for scaling the synthesized sound signal.
 19. A device as claimedin claim 17, wherein: the sound signal is a speech signal; thedeterminer of concealment/recovery parameters comprises a classifier ofsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and the decoder makes again used for scaling the synthesized sound signal at the beginning ofthe first non erased frame received after frame erasure equal to a gainused at an end of said received first non erased frame: during atransition from a voiced frame to an unvoiced frame, in the case of alast non erased frame received before frame erasure classified as voicedtransition, voice or onset and a first non erased frame received afterframe erasure classified as unvoiced; and during a transition from anon-active speech period to an active speech period, when the last nonerased frame received before frame erasure is encoded as comfort noiseand the first non erased frame received after frame erasure is encodedas active speech.
 20. A device for conducting concealment of frameerasure caused by frames of an encoded sound signal erased duringtransmission from an encoder to a decoder, comprising: in the encoder, adeterminer of concealment/recovery parameters selected from the groupconsisting of a signal classification parameter, an energy informationparameter and a phase information parameter related to the sound signal;and a communication link for transmitting to the decoderconcealment/recovery parameters determined in the encoder; wherein: thedecoder conducts frame erasure concealment and decoder recovery inresponse to the concealment/recovery parameters received from theencoder; the energy information parameter is not transmitted from theencoder to the decoder; and when a gain of a LP filter of a first nonerased frame received following frame erasure is higher than a gain of aLP filter of a last frame erased during said frame erasure, the decoderadjusts an energy of an LP filter excitation signal produced in thedecoder during the received first non erased frame to a gain of the LPfilter of said received first non erased frame.
 21. A device as claimedin claim 20, wherein: the decoder, for adjusting the energy of the LPfilter excitation signal produced in the decoder during the receivedfirst non erased frame to the gain of the LP filter of said receivedfirst non erased frame, uses the following relation:$E_{q} = {E_{1}\frac{E_{{LP}\; 0}}{E_{{LP}\; 1}}}$ where E₁ is an energyat an end of a current frame, E_(LPO) is an energy of an impulseresponse of a LP filter of a last non erased frame received before theframe erasure, and E_(LP1) is an energy of an impulse response of the LPfilter of the received first non erased frame following frame erasure.22. A device for conducting concealment of frame erasure caused byframes of an encoded sound signal erased during transmission from anencoder to a decoder, comprising: in the encoder, a determiner ofconcealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter and aphase information parameter related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder; wherein: the concealment/recoveryparameters include the phase information parameter; to determine thephase information parameter, the determiner comprises a searcher of aposition of a first glottal pulse in a frame of the encoded soundsignal; and the searcher encodes a shape, sign and amplitude of thefirst glottal pulse and the communication link transmits the encodedshape, sign and amplitude from the encoder to the decoder.
 23. A devicefor conducting concealment of frame erasure caused by frames of anencoded sound signal erased during transmission from an encoder to adecoder, comprising: in the encoder, a determiner ofconcealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter and aphase information parameter related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder; wherein: the concealment/recoveryparameters include the phase information parameter; to determine thephase information parameter, the determiner comprises a searcher of aposition of a first glottal pulse in a frame of the encoded soundsignal; and the searcher measures a sample of maximum amplitude within apitch period as the first glottal pulse; and the determiner comprises aquantizer of the position of the sample of maximum amplitude within thepitch period.
 24. A device for conducting concealment of frame erasurecaused by frames of an encoded sound signal erased during transmissionfrom an encoder to a decoder, comprising: in the encoder, a determinerof concealment/recovery parameters selected from the group consisting ofa signal classification parameter, an energy information parameter and aphase information parameter related to the sound signal; and acommunication link for transmitting to the decoder concealment/recoveryparameters determined in the encoder; wherein: the sound signal is aspeech signal; the determiner of concealment/recovery parameterscomprises a classifier of successive frames of the encoded sound signalas unvoiced, unvoiced transition, voiced transition, voiced, or onset;and the determiner of concealment/recovery parameters comprises acomputer of the energy information parameter in relation to a maximum ofa signal energy for frames classified as voiced or onset, and inrelation to an average energy per sample for other frames.
 25. A devicefor the concealment of frame erasure caused by frames erased duringtransmission of a sound signal encoded under the form of signal-encodingparameters from an encoder to a decoder, wherein: the decoder determinesconcealment/recovery parameters selected from the group consisting of asignal classification parameter, an energy information parameter and aphase information parameter related to the sound signal, for producing,upon occurrence of frame erasure, a replacement frame selected from thegroup consisting of a voiced frame, an unvoiced frame, and a framedefining a transition between voiced and unvoiced frames; and thedecoder conducts erased frame concealment and decoder recovery inresponse to determined concealment/recovery parameters; wherein: theconcealment/recovery parameters include the energy informationparameter; the energy information parameter is not transmitted from theencoder to the decoder; and the decoder, for conducting frame erasureconcealment and decoder recovery when a gain of a LP filter of a firstnon erased frame received following frame erasure is higher than a gainof a LP filter of a last frame erased during said frame erasure, adjustsan energy of an LP filter excitation signal produced in the decoderduring the received first non erased frame to a gain of the LP filter ofsaid received first non erased frame using the following relation:$E_{q} = {E_{1}\frac{E_{{LP}\; 0}}{E_{{LP}\; 1}}}$ where E₁ is an energyat an end of a current frame, E_(LPO) is an energy of an impulseresponse of a LP filter of a last non erased frame received before theframe erasure, and E_(LP1) is an energy of an impulse response of the LPfilter to the received first non erased frame following frame erasure.